Vi kører de efterspurgte kommandoer
## library(MASS)
## cats
head(cats)
## Sex Bwt Hwt
## 1 F 2.0 7.0
## 2 F 2.0 7.4
## 3 F 2.0 9.5
## 4 F 2.1 7.2
## 5 F 2.1 7.3
## 6 F 2.1 7.6
dim(cats)
## [1] 144 3
summary(cats)
## Sex Bwt Hwt
## F:47 Min. :2.000 Min. : 6.30
## M:97 1st Qu.:2.300 1st Qu.: 8.95
## Median :2.700 Median :10.10
## Mean :2.724 Mean :10.63
## 3rd Qu.:3.025 3rd Qu.:12.12
## Max. :3.900 Max. :20.50
Vi kan samtidigt også importere cats et datasæt i R i stedet for som en del af cats pakken, idet vi erklærer;
cats <- cats
Bemærk at vi i cats datasættet har med forskellige enheder at gøre. Specifikt er det således at Bwt er i kilogram, mens Hwt er i gram. For at udregne forholdet mellem Hwt og Bwt, for så da at kunne omsætte til procent, må vi altså have dem på samme enhed. I dette tilfælde vil vi regne i kg, så vi ganger Hwt med \(1=\frac{1kg}{1000g},\) for da at tage forholdet Hwt/Bwt og tilsidst gange med \(100\) for at få resultatet i procent. Den resulterende transformation, efterfulgt af et fornyet kig på head(cats)
cats <- transform(cats, pct = (cats$Hwt/1000)*(1/cats$Bwt)*100)
head(cats)
## Sex Bwt Hwt pct
## 1 F 2.0 7.0 0.3500000
## 2 F 2.0 7.4 0.3700000
## 3 F 2.0 9.5 0.4750000
## 4 F 2.1 7.2 0.3428571
## 5 F 2.1 7.3 0.3476190
## 6 F 2.1 7.6 0.3619048
Således at vi eksempelvis i række to har at hjertevægten af kat nummer to udgør smålige \(0,37\%\) af kropsvægten af katten - overraskende lidt.
As requested we may subset cats with the subset function, remembering that any gentleman always defines femaleData first;
femaleData <- subset(cats, cats$Sex == "F")
maleData <- subset(cats, cats$Sex == "M")
head(femaleData)
## Sex Bwt Hwt pct
## 1 F 2.0 7.0 0.3500000
## 2 F 2.0 7.4 0.3700000
## 3 F 2.0 9.5 0.4750000
## 4 F 2.1 7.2 0.3428571
## 5 F 2.1 7.3 0.3476190
## 6 F 2.1 7.6 0.3619048
head(maleData)
## Sex Bwt Hwt pct
## 48 M 2.0 6.5 0.3250000
## 49 M 2.0 6.5 0.3250000
## 50 M 2.1 10.1 0.4809524
## 51 M 2.2 7.2 0.3272727
## 52 M 2.2 7.6 0.3454545
## 53 M 2.2 7.9 0.3590909
We will be testing a linear regression model on the data prepared in HS1. Starting off with a (Bwt,Hwt) scatterplot, note once again that Hwt is in units of grams, with Bwt being in kilograms.
plot(cats$Bwt,cats$Hwt, main = "(Bwt, Hwt) scatterplot", xlab = "Cat bodyweight in kg", ylab = "Cat heartweight in g")
The statistical model behind the linear regression may be recited from Definition 6.1 of “Introduktion til Statistik.”
We assume \(Y_1,\ldots,Y_n \overset{\text{iid.}}{\sim}\mathcal{N}(\alpha+\beta x_i,\sigma^2),\,\)with \(\alpha,\beta\in\mathbb{R},\,\sigma^2>0\) being unknown parameters.
Finally we will for maleData be using the lm-function in to fit the linear regression model to the
maleData, assuming heartweight to be a linear function of bodyweight such that we might write Hwt \(= \alpha + \beta\,\cdot\)Bwt.
linreg <- lm(Hwt ~ Bwt, data = maleData)
Further following the procedures of Chapter 6 (in particular page 133) in “Introduktion til Statistik” we might complete a visual model validation of our linear regression model by getting to calculate the estimated values of the model, and the standardized residuals of the data also. Respectively, this will be done with the assignment
fit <- fitted(linreg)
rst <- rstandard(linreg)
such that we might plot the fitted values against the standardized residuals, adding a horizontal line through zero also
plot(fit, rst, main = "(Estimate, Std. Res.)-plot for linreg model on maleData", xlab = "Estimated (fitted) Heartweight (g)", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) ) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
abline(0,0)
Looking at the above plot, we note that if the assumption that the variance of the heartweight data is independent of the mean is true, we would expect that the dispersion of points at region of fitted heartweight values would be akin to the dispersion in any other region.If the linearity assumption of our model were to be true, we would expect the points to be relatively equally dispersed as \(\mathcal{N}(0,1)\) on both sides of \(0\) across the fitted values. - ie. that the standardized residuals are themselves rather \(\mathcal{N}(0,1)\) distributed.
The plot looks rather respectable on both these fronts, as 1. The residual values fall within the interval \(\left({-2,2}\right),\) as would be expected of a standard normal distribution. 2. No wierd “trumpet,” “quadratic” or other wierd distortions to the std. residuals values are present, they seem to keep rather constantly dispersed around zero within any fitted heartweight region.
Note that the values of
mean(rst)
## [1] 0.0004535485
var(rst)
## [1] 1.015539
fit in line.
Finally the assumption that the heartweight data is normally distributed may be visually assessed through the use of a QQ-plot of the standardized residuals. If the normallity assumption of the model were to be true we would expect the standardized residuals to be \(\mathcal{N}(0,1)\) distributed, which would reveal itself if the points if the theoretical \(\mathcal{N}(0,1)\) quantiles were to match the empirical quantiles of the standardized residuals, such that the points of the QQ-plot were to hug the line intersecting the mean \(0\) and with slope equal to the standard deviation of an \(\mathcal{N}(0,1)\) distributed value, ie. with slope \(\sqrt{1}=1\). The plot and the corresponding line intersecting \(0\) and having slope \(1\) can be made with the commands below
qqnorm(rst)
abline(0,1)
revealing that no sufficiently large abnormalities seem to be present that would disprove our theory that the standardized residuals are \(\mathcal{N}(0,1)\) distributed, and thus, that the data follows a normal distribution also. Do however observe, that making a QQ-plot of the data itself won’t lead to much fruition, as the means are assumed to vary according to the bodyweight (unless ofcourse we have \(\beta=0\)).
All in all, we say that it seems fairly reasonable to assume that the data is distributed according to the above written model.
A valueable way to get insight into the linear regression model made by is to deploy the
summary function.
summary(linreg)
##
## Call:
## lm(formula = Hwt ~ Bwt, data = maleData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7728 -1.0478 -0.2976 0.9835 4.8646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1841 0.9983 -1.186 0.239
## Bwt 4.3127 0.3399 12.688 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.557 on 95 degrees of freedom
## Multiple R-squared: 0.6289, Adjusted R-squared: 0.625
## F-statistic: 161 on 1 and 95 DF, p-value: < 2.2e-16
which reveals a rather lot of details around the fitted linear model. Information of how to interpret the output of summary may be found through the ?lm-command.
Another valuable trick is that it is possible to draw out the slope(\(\beta\)) and intersect (\(\alpha\)) values for the linear regression of the data, and use these to define local variables beta and alpha through c(1,2,3,4)
alpha <- linreg$coefficients[[1]]
## [1] -1.184088
beta <- linreg$coefficients[[2]]
## [1] 4.312679
Noting the use of [[]] to extract the numerical value instead of a named value as results from using
linreg$coefficients[1]
## (Intercept)
## -1.184088
linreg$coefficients[2]
## Bwt
## 4.312679
Alternatively the format,
as.numeric(linreg$coefficients[1])
## [1] -1.184088
as.numeric(linreg$coefficients[2])
## [1] 4.312679
evidently works as well.
We have thus established the model that maleData$Hwt = -1.1840879 + 4.3126787 maleData$Bwt. As the rate change of an affine function is constant we may just note that an increase in bodyweight of \(0.5kg,\) will by the model provide an 0.5 beta = 2.1563394gram increase in heartweight.
Note that the model is rather imperfect, as it for example for maledata$Bwt \(\searrow 0 kg\) estimates maleData$Hwt \(\searrow -1.1840879 g\). And while the above problem to the model is of course in itself preposterous, we may note that if we wanted to force to fit the best linear model to the data assuming intersection at zero, we might’ve done this through the use of
linreg <- lm(Hwt ~ Bwt - 1, data = maleData) ie. with the inclusion of “-1” being the symbolic manipulation required to tell to model based on the assumption of intersection in \(0.\)
Let us just remind ourselves of the standard frequentistic interpretation ascribed to confidence intervals of level \(1-\alpha^*\) for the parameter \(\beta\):
Imagine the experiment being conducted multiple times in the same way, but independent of the original and each other - in our case a new batch of cats being weighed, their sex noted, and their heartweight noted also. Using the dataset corresponding to each repetition of experiment we calculate the \(1-\alpha^*\) confidence intervals for \(\beta\) (as will be done below). We would then expect the true value of \(\beta\) to be present in approximately ratio \(1-\alpha^*\) of all the intervals calculated.
From Theorem 6.9 in “Introduktion to Statistik” we have that an \(1-\alpha^*\) confidence-interval for \(\beta\) can be found as the interval bounded by \[\hat{\beta}\pm t_{n-2,1-alpha^*/2}SE(\hat\beta)\] Note that we may find the standard error for \(\beta,\,SE(\hat\beta)\) in summary(linreg) as “std error for Bwt,” which is 0.3399 in our instance. Extraction of this value might occur via
seb <- sqrt(diag(vcov(linreg)))[[2]]
## [1] 0.3398942
Having taken a look at ?qt as well as an extra look at Theorem 6.9, we may in order to get an \(95\%\) confidence interval do
n <- nrow(maleData)
## [1] 97
lKI <- beta - qt(0.975, n-2)*seb
uKI <- beta + qt(0.975, n-2)*seb
KI <- c(lKI, uKI)
## [1] 3.637904 4.987454
ie we get the \(95\%\) confidence-interval (3.6379035,4.987454).
Having browsed ?confint we may refind the results of the previous subproblem with confint(linreg) as follows;
confint(linreg)
## 2.5 % 97.5 %
## (Intercept) -3.165939 0.7977636
## Bwt 3.637904 4.9874540
#We try to extract the interval;
confint(linreg)[2,]
## 2.5 % 97.5 %
## 3.637904 4.987454
as.numeric(confint(linreg)[2,]) #KI
## [1] 3.637904 4.987454
confint(linreg)[2,1] #lKI
## [1] 3.637904
confint(linreg)[2,2] #uKI
## [1] 4.987454
For \(x_i\) being the bodyweight data and \(Y_i\) being the stochastic variable representing the heartweight both of the \(i\)’th male cat, we will by assumption have \(E(Y_i)=\alpha + \beta x_i.\) Finding a designmatrix \(A\) such that we way write \(\xi:=\left({E(Y_1), E(Y_2), \ldots, E(Y_{97})}\right)^T\) as \[ \xi=A\begin{pmatrix} \alpha\\ \beta \end{pmatrix}. \] Consequently, as \(\xi\in M_{97,1}(\mathbb{R}),\,\left({\alpha, \beta}\right)^T\in M_{2,1}(\mathbb{R}),\) we will by matrix multiplication need to have \(A\in M_{97,2}(\mathbb{R}),\) and in particular for \[\begin{align*} A\begin{pmatrix} \alpha\\ \beta \end{pmatrix} \overset{\swarrow}{\equiv} \begin{pmatrix} a_{1,1} & a_{1,2} \\ a_{2,1} & a_{2,2} \\ \vdots & \vdots \\ a_{97,1} & a_{97,2} \end{pmatrix}\begin{pmatrix} \alpha\\ \beta \end{pmatrix}&\overset{\swarrow}{=} \begin{pmatrix} a_{1,1}\alpha + a_{1,2}\beta \\ a_{2,1}\alpha + a_{2,2}\beta \\ \vdots \\ a_{97,1}\alpha+a_{97,2}\beta \end{pmatrix}\overset{\searrow}{=}\begin{pmatrix} \alpha + \beta x_1 \\ \alpha + \beta x_2 \\ \vdots \\ \alpha + \beta x_{97} \\ \end{pmatrix}\overset{\searrow}{\equiv}\xi\\ &\Leftrightarrow\\ a_{i,1}\equiv 1&,\,\,a_{i,2}=x_i\\ &\Leftrightarrow\\ A&=\begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_{97} \end{pmatrix} \end{align*}\]
We may recite the (normal) statistical model of two independent samples as given in D. 5.1 of “Introduktion til Statistik”;
Let \(X_1,\ldots,X_{n_1}\) and \(Y_1,\ldots,Y_{n_2}\) be independent normally distributed random variables with \(X_i\sim\mathcal{N}(\mu_1,\sigma^2)\) and \(Y_j\sim\mathcal{N}(\mu_2,\sigma^2),\) with \(\mu_1,\mu_2\in\mathbb{R}\) and \(\sigma^2>0\) being unknown parameters
We will calculate estimates for the mean and variance of every imaginable parameter below
mean(femaleData$Bwt) #2.359574
mean(femaleData$Hwt) #9.202128
mean(femaleData$pct) #0.3915119
#--
mean(maleData$Bwt) #2.9
mean(maleData$Hwt) #11.32268
mean(maleData$pct) #0.3894547
#-
var(femaleData$Bwt) #0.07506938
var(femaleData$Hwt) #1.843256
var(femaleData$pct) #0.002630804
#--
var(maleData$Bwt) #0.2185417
var(maleData$Hwt) #6.46323
var(maleData$pct) #0.002858461
Though noting that we for our particular analysis happen to be interested in the data femaleData$pct and maleData$pct each a sample of the entire cats thus making two samples.
Note that our model has three primary assumptions;
The observations in cats$pct are independent
The variance of the femaleData$pct is equal the variance of the maleData$pct (variance homogeneity)
The observations are normally distributed.
The independence assumption will usually follow from the method by which the experiment is conducted and the data is collected.
Observe from the above calculations in particular that
mean(femaleData$pct) #0.3915119
mean(maleData$pct) #0.3894547
var(femaleData$pct) #0.002630804
var(maleData$pct) #0.002858461
the latter pair fitting rather well with our assumption of variance homogeneity between femaleData$pct and maleData$pct - more precisely we may say that on the basis of data, we do not yet see any reason to doubt variance homogeneity as an assumption.
Note that we might define the following histogram-making, and normal density overdrawing function, to then make a histogram with density estimates overlayed for each of femaleData$pct,maleData$pct
NHistDensityDraw<-function(varname, xlab = deparse(substitute(varname)), main = paste("Histogram of", deparse(substitute(varname))), ...) {
#Draws normaldistribution density on top of a histogram.
seqT<-seq(min(varname),max(varname), by = 1/(5*10^4*(max(varname)-min(varname)))) #Counts from minimum of data to maximum of data via the by mechanism
f1T<-dnorm(seqT,mean=mean(varname),sd=sd(varname)) #creating density
maxf1T <- max(f1T)
hist(varname, prob=1, ylim=c(0,maxf1T), xlab = xlab, main = main, ...) #Drawing histogram, making sure we get the entire height of the density estimate
lines(seqT,f1T) #Drawing density
}
NHistDensityDraw(femaleData$pct)
NHistDensityDraw(maleData$pct)
which also doesn’t provide a cause for concern to our model assumptions. Getting to the meat of of the modelcontrol, we may introduce qq-plots for the two datasets
qqnorm(femaleData$pct, main = "Female QQ Plot")
abline(mean(femaleData$pct),sd(femaleData$pct))
qqnorm(maleData$pct, main = "Male QQ Plot")
abline(mean(maleData$pct),sd(maleData$pct))
Heed that in both plots, the data seems to hug the empirical line rather well, thus providing no immediate refutiation of femaleData$pct or maleData$pct being normally distributed.
Looking through t.test we may note that a \(95\%\) confidence interval for the expected difference bewteen femaleData$pct and maleData$pct may be obtained through
t.test(femaleData$pct, maleData$pct, paired = F, var.equal = T)
##
## Two Sample t-test
##
## data: femaleData$pct and maleData$pct
## t = 0.21935, df = 142, p-value = 0.8267
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01648249 0.02059684
## sample estimates:
## mean of x mean of y
## 0.3915119 0.3894547
noting that we have specified that we are not dealing with a paired two sample - we assume independence between the two groups of female, male data respectively. - Someplace we find a paired analysis is in drugtrials with the groupings “before taking drug” and “after taking drug.” As we sample twice from each person, a rather stark dependence between the samples occures, as people are rather non-independent of themselves.
We also include var.equal = T, thus telling that we assume variance homogeneity between the two samples.
Note that we may extract the confidence interval via
KI <- as.numeric(t.test(femaleData$pct, maleData$pct, paired = F, var.equal = T)[[4]])
KI
## [1] -0.01648249 0.02059684
lKI <- KI[1]
uKI <- KI[2]
Note that as \(0\) is firmly placed in the confidence interval, there is not immidiately any evidence to say that there is a percentagewise difference in heartweight/bodyweight between male and female cats.
Having looked at ?confint we may use the requested commands;
fit <- lm (pct ~ Sex, data=cats)
summary(fit)
##
## Call:
## lm(formula = pct ~ Sex, data = cats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.129012 -0.038876 -0.003071 0.034442 0.136186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.391512 0.007697 50.863 <2e-16 ***
## SexM -0.002057 0.009379 -0.219 0.827
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05277 on 142 degrees of freedom
## Multiple R-squared: 0.0003387, Adjusted R-squared: -0.006701
## F-statistic: 0.04811 on 1 and 142 DF, p-value: 0.8267
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 0.37629566 0.40672807
## SexM -0.02059684 0.01648249
We may then for example extract the residual standard error through
rstderr <- sqrt(deviance(fit)/df.residual(fit))
## [1] 0.05277038
rstderr^2
## [1] 0.002784713
For \(Z_1,\ldots,Z_{144}\) being the stochastic variables representing pct for the 144 cats, we have assumed \(E(Z_i)=\mu_1,\,E(Z_j)=\mu_2\) for \(i=1,\ldots,97,\,j=98,\ldots,144\) respectively representing the male and female cats. We may proceed as with HS2.6 as we endevour to write a designmatrix \(C\) such that for \(\eta:=\left({E(Z_1),\ldots,E(Z_{144})}\right)^T\equiv\left({\mu_1,\ldots,\mu_1,\mu_2,\ldots,\mu_2}\right)\)
\[ \eta=C\begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix}. \] Consequently, as \(\eta\in M_{144,1}(\mathbb{R}),\,\left({\mu_1, \mu_2}\right)^T\in M_{2,1}(\mathbb{R}),\) we will by matrix multiplication need to have \(C\in M_{144,2}(\mathbb{R}),\) and in particular for \[\begin{align*} C\begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix} \overset{\swarrow}{\equiv} \begin{pmatrix} c_{1,1} & c_{1,2} \\ c_{2,1} & c_{2,2} \\ \vdots & \vdots \\ c_{144,1} & c_{144,2} \end{pmatrix}\begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix}&\overset{\swarrow}{=} \begin{pmatrix} c_{1,1}\mu_1 + c_{1,2}\mu_2 \\ c_{2,1}\mu_1 + c_{2,2}\mu_2 \\ \vdots \\ c_{144,1}\mu_1 + c_{144,2}\mu_2 \end{pmatrix}\overset{\searrow}{=}\begin{pmatrix} \mu_1\\ \vdots\\ \mu_1\\ \mu_2\\ \vdots\\ \mu_2 \end{pmatrix}\overset{\searrow}{\equiv}\eta\\ &\Leftrightarrow\\ c_{i,1}\equiv 1_{\left\{{1,\ldots,97}\right\}}(i)&,\,\,c_{i,2}=1_{\left\{{98,\ldots,144}\right\}}(i)\\ &\Leftrightarrow\\ C&=\begin{pmatrix} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 0 & 1 \\ \vdots & \vdots \\ 0 & 1 \end{pmatrix} \end{align*}\]
Note that for \(X_1,...,X_{10} \sim\mathcal{N}(0,1),\) we may, for example by NRH problem 1.1, convolution results, or C9.46 in EH (MatStat Bind 2) conclude regarding the desired distribution. Define in accordance with C9.46 \(C:=\frac{1}{n}\left({1,\ldots,1}\right)\in\mathbb{R}^{1\times10},\,X:=\left({X_1,\ldots,X_{10}}\right)^T\) that \(\overline{X}\equiv\frac{1}{n}\sum_{i=1}^{n}{X_i}\equiv CX \sim \mathcal{N}_1(C\cdot0,CIC^T)=\mathcal{N}_1(0,\frac{1}{n^2}n)=\mathcal{N_1}(0,\frac{1}{n}),\) which is also in accordance with the result reached in NRH1.1.
We may test this result through gent number of simulations of \(10\) outcomes of \(\mathcal{N}(0,1),\) that we will be averaging over;
n<-10
gent <- 5*10^3
gns <- rep(NA,gent)
for (i in 1:gent) {
X<-rnorm(n,0,1)
gns[i] = 1/n*sum(X)
}
hist(gns, prob = T, ylim = c(0.0,1.3))
f<- function(x) dnorm(x,mean=0,sd=1/sqrt(n))
plot(f,-1,1,add=T)
Note also that we might we might create a QQ-plot for the data;
qqnorm(gns)
abline(0,1/sqrt(n))
for which we see that by Introduktion til Statistik p. 68, everything is as it should be.
For each outcome of \(X_1,\ldots,X_{10}\) as \(x_1,\ldots,x_{10},\) denote by \(x_{(1)}\leq x_{(2)}\leq\ldots\leq x_{(10)}\) the ordering of the outcomes. As such the median \(\widehat{X},\) of \(X_1,\ldots,X_{10}\) will equal to \(\frac{x_{(5)}+x_{(6)}}{2},\) and can be calculated with median -command.
So, lets reimplement the above method;
med<-rep(NA,gent)
for (i in 1:gent) {
X <- rnorm(n, 0, 1)
med[i] <- median(X)
}
hist(med, prob=TRUE, ylim = c(0.0,1.25)) # Normeret histogram
f <- function(x) dnorm(x, mean=0, sd=1/sqrt(n))
plot(f,-1,1,add=TRUE)
#Den tegnede tæthed passer vist ikke helt, idet den designerede tæthed lader til at være mere sammentrykket (højere modus) end histogrammet
hist(med, prob=TRUE, ylim = c(0.0,1.25))
f_2<-function(x) dnorm(x, mean= 0, sd = sd(med))
#Med den empiriske lader det til at gå lidt bedre.
plot(f_2,-1.2,1.2,add = T)
qqnorm(med)
#qqline(med)
mean(med)
## [1] 0.01144251
sd(med)
## [1] 0.3652325
abline(0,sd(med))
With mean(med)= 0.0114425 and sd(med)= 0.3652325, it seems reasonable by the QQ-plot to assume that the median is going to be distributed approx. as \(\mathcal{N}(0,0.1333948).\)
Comparing the estimates for the variance of \(\overline{X}\) and \(\widehat{X}\) we see
var(gns)
## [1] 0.1001461
var(med)
## [1] 0.1333948
with the variance of the mean generally being a good bit lower than the variance of the median.
We may do a scatterplot of the mean against the median as follows;
plot(x = gns, y = med, xlim = c(-1.1,1.1), ylim = c(-1.1,1.1))
We may also use ?cor to understand that
cor(gns, med)
## [1] 0.02536134
is to be interpreted as …
For \(X\) being a two dimensional stochastic variable with \[EX=\begin{pmatrix}{1\\0}\end{pmatrix},\,VX= \begin{pmatrix} 4&4\\ 4&16 \end{pmatrix} \] that as \(\begin{pmatrix}{1\\0}\end{pmatrix}=EX \equiv E\begin{pmatrix}{X_1\\X_2}\end{pmatrix} \equiv \begin{pmatrix}{EX_1\\EX_2}\end{pmatrix}\) we get \(EX_1=1,\,EX_2=0.\) In the same way \[ \begin{pmatrix} 4&4\\ 4&16 \end{pmatrix}=VX\equiv\begin{pmatrix} VX_1&cov(X_1,X_2)\\ cov(X_2,X_1)&VX_2 \end{pmatrix}\equiv \begin{pmatrix} VX_1&cov(X_1,X_2)\\ cov(X_1,X_2)&VX_2 \end{pmatrix} \] grants us \(VX_1=4,\,VX_2=16,\,Cov(X_1,X_2)\equiv Cov(X_2,X_1) = 4,\) and thus \[ Corr(X_1,X_2):=\frac{Cov(X_1,X_2)}{\sqrt{VX_1\,VX_2}}\overset{\cdot}{=}\frac{4}{\sqrt{64}}=\frac{4}{8}=\frac{1}{2}. \]
For \(Y\equiv\begin{pmatrix}{Y_1\\Y_2}\end{pmatrix}:=\begin{pmatrix}{X_1+2\\X_2-X_1+1}\end{pmatrix},\) we get \[ EY\equiv\begin{pmatrix}{EY_1\\EY_2}\end{pmatrix}\equiv\begin{pmatrix}{E(X_1+2)\\E(X_2-X_1+1)}\end{pmatrix}=\begin{pmatrix}{EX_1+2\\EX_2-EX_1+1}\end{pmatrix}\overset{\cdot}{=}\begin{pmatrix}{1+2\\0-1+1}\end{pmatrix}\equiv\begin{pmatrix}{3\\0}\end{pmatrix}. \]
In calculating the matrix \(VY\) we may use one of two methods;
We also have \[ VY_1\equiv V\mathopen{}\left({X_1+2}\right)\mathclose{}\overset{\text{MI L16.17}}{=}VX_1=4, \] and \[ VY_2\equiv V\left({X_2-X_1+1}\right)\overset{\text{MI L16.17}}{=}V\left({X_2-X_1}\right). \]
Let \(Z,Q\) being r.v.r.v.’s variables defined on a common background space \(\left({\Omega,\mathbb{F}, P}\right).\) If both \(Z,Q\) have second moments then \[ V\left({Z\pm Q}\right) = VZ + VQ \pm 2Cov(Z,Q) \]
As both \(X_1,X_2\) have second moment, such that \(\left({X_1,X_2}\right)\) by MI E19.8 has covariance, we may thus by the expansion of MI L19.13 above conclude \[ VY_2 = V\left({X_2 - X_1}\right) \overset{\text{L19.13}}{=}VX_2+VX_1-2Cov(X_2,X_1)\underset{\text{MI L19.12}}{\overset{\cdot}{=}} 4+16-2\cdot4=12. \]
Let \(Z,Q,W\) being r.v.r.v.’s variables defined on a common background space \(\left({\Omega,\mathbb{F}, P}\right).\) If both \(\left({Z,Q}\right),\,\left({Z,W}\right)\) have covariance then \[ Cov(Z,Q\pm W) = Cov(Q\pm W, Z) = Cov(Q,Z)\pm Cov(W,Z) = Cov(Z,Q)\pm Cov(Z,W) \]
Having shown above that \(Y_1\) and \(Y_2\) both have finite variance and thus finite second moments, we may by the MI L19.12 expansion above conclude \[\begin{align} Cov(Y_1,Y_2)&\equiv Cov(X_1+2,X_2-X_1+1)\\ \overset{\text{MI (19.8)}}{=}&Cov(X_1,X_2-X_1)\\ \overset{\text{MI L19.12}}{=}&Cov(X_1,X_2)-Cov(X_1,X_1)\\ \equiv& Cov(X_1,X_2)-VX_1\overset{\cdot}{=}4-4=0 \end{align}\]
As such we altogether get \[ VY\equiv\begin{pmatrix}4&0\\0&12\end{pmatrix}. \]
For \(A:=\begin{pmatrix}1&0\\-1&1\end{pmatrix},\,b:=\begin{pmatrix}{2\\1}\end{pmatrix}\) note that \(Y=AX+b.\) As such by \[\begin{align} VY &\equiv V\left({AX+b}\right)\\ &= V\left({AX}\right)\\ &= AVXA^T\\ &\equiv\begin{pmatrix}1&0\\-1&1\end{pmatrix}\begin{pmatrix}4&4\\4&16\end{pmatrix}\begin{pmatrix}1&-1\\0&1\end{pmatrix}\\ &=\begin{pmatrix}1&0\\-1&1\end{pmatrix}\begin{pmatrix}4&0\\4&12\end{pmatrix}\\ &=\begin{pmatrix}4&0\\0&12\end{pmatrix} \end{align}\]
No we cannot conclude that \(Y_1,Y_2\) are independent just because they have covariance \(0.\) In general we only have \[ Y_1,Y_2\text{ independent} \Rightarrow Cov(Y_1,Y_2)=0 \] and only (?) in the case in which \(Y_1,Y_2\sim\mathcal{N}\) do we have \[ Y_1,Y_2\text{ independent and normally distributed} \Leftrightarrow Cov(Y_1,Y_2)=0. \]
For \(A:=\begin{pmatrix}1&0\\-1&1\end{pmatrix},\,b:=\begin{pmatrix}{2\\1}\end{pmatrix}\) note that \(Y=AX+b,\) as presented in the matrix-method of subproblem2. By EH K9.46 as \(X\) is now assumed regularly normally distributed, we will thus have \(Y\sim\mathcal{N}\left({{A\xi+b},{AVXA^T}}\right)=\mathcal{N}\left({{EY},{VY}}\right)\overset{\cdot}{=}\mathcal{N}\left({{\begin{pmatrix}{3\\0}\end{pmatrix}},{\begin{pmatrix}4&0\\0&12\end{pmatrix}}}\right).\)
Note that we might define new vectors \[ \tilde{p}_0(x):=p_0(x)\equiv1,\quad \tilde{p}_1(x):=p_1(x)\equiv x,\quad \tilde{p}_2(x):=\frac{p_2(x)+p_0(x)}{3}\overset{\cdot}{=}\frac{3x^2-1+1}{3}=x^2, \] which most definitely span \(\mathscr{P}_2\equiv\left\{{f:\,\left({-1,1}\right)\rightarrow\mathbb{R}^2|\exists a_0,a_1,a_2\in\mathbb{R},\,f(x)=a_2x^2+a_1*x+a_0}\right\}\), and as we have transfered from \(\left({p_0,p_1,p_2}\right)\) to \(\left({\tilde{p}_0,\tilde{p}_1,\tilde{p}_2}\right)\) using only linear combinations, we will also have that we may therefore conclude that \(\left({p_0,p_1,p_2}\right)\) must too be a basis for \(\mathscr{P}_2\)
p0 <- function(x) 1
p1 <- function(x) x
p2 <- function(x) 3*x^2 - 1
p12 <- function(x) p1(x)*p2(x)
integrate(p1,-1,1)
## 0 with absolute error < 1.1e-14
integrate(p2,-1,1)
## -2.775558e-17 with absolute error < 1.7e-14
integrate(p12,-1,1)
## 0 with absolute error < 9.2e-15
Note as described in the problem that as \(\left\lVert p_0\right\rVert^2=2,\,\left\lVert p_1\right\rVert^2=\frac{2}{3},\,\left\lVert p_2\right\rVert=\frac{8}{5},\) we may define an orthonormal basis for \(\mathscr{P}_2\) as \begin{equation} e_0(x):=\frac{1}{\sqrt{2}},\quad e_1(x)=\sqrt{\frac{3}{2}}x,\quad e_2(x):=\sqrt{\frac{5}{8}}(3x^2-1), \end{equation} in with
e0 <- function(x) {1/(sqrt(2))}
e1 <- function(x) {sqrt(3/2)*x}
e2 <- function(x) { sqrt(5/8)*(3*x^2-1)}
Note now, that by C9.28 in EH, we may for \(Y_0,Y_1,Y_2\sim\mathcal{N}(0,1)\) write any regular normal distributed random variable \(X\) on \(\mathscr{P}_2\) with precision \(\mathopen{}\left\langle \cdot, \cdot \right\rangle\mathclose{},\) and center \(\xi\) as the sum \[X:=\sum_{n=0}^{2}{Y_ie_i}+\xi.\] Implementing this in we get
xi <- c(0,0,0)
Y_0 <- rnorm(1)
Y_1 <- rnorm(1)
Y_2 <- rnorm(1)
X <- function(x) {Y_0*e0(x)+Y_1*e1(x)+Y_2*e2(x) + xi}
We may also plot the simulated function;
plot(X,-1,1)
## Warning in Y_0 * e0(x) + Y_1 * e1(x) + Y_2 * e2(x) + xi: longer object length is
## not a multiple of shorter object length
We continue from subproblem 3, but before we do, we will shorten the construction from 3; Note we may construct a function that simulated our desired function as follows;
f <- function(x) {
temp <- rnorm(3)
X <- temp[1]*e0(x)+temp[2]*e1(x)+temp[3]*e2(x)
return(X)
}
As such, we may do the desired simulation as follows
plot(0,0, type="n", xlim=c(-1,1), ylim=c(-3,3), xlab="x", ylab="f(x)")
for (i in 1:25) {
plot(f,-1,1, add=TRUE) }
For \(X\) regularly normally distributed on \(\mathscr{P}_2\) with center \(0\) and precision \(\left\langle \cdot, \cdot \right\rangle\) Defining \(Z:=t(X),\) as
Having already completed EH 10.5 in RMD we may copy the assignment of \(X\) and \(A\) with
A <- matrix(c(1,0,1,-1,1,-1,1,1,1,1), nrow = 5, ncol = 2, byrow = T) #By col is default
X <- c(-0.187,-1.731,-0.184,2.252,1.775)
We note that we may reproduce the calculated estimates for \(\alpha_1,\alpha_2,\sigma\) through lm with
model <- lm(X~A-1)
model.matrix(model)
## A1 A2
## 1 1 0
## 2 1 -1
## 3 1 -1
## 4 1 1
## 5 1 1
## attr(,"assign")
## [1] 1 1
summary(model)
##
## Call:
## lm(formula = X ~ A - 1)
##
## Residuals:
## 1 2 3 4 5
## -0.5720 -0.6305 0.9165 0.3815 -0.0955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## A1 0.3850 0.3386 1.137 0.3381
## A2 1.4855 0.3785 3.924 0.0294 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.757 on 3 degrees of freedom
## Multiple R-squared: 0.8477, Adjusted R-squared: 0.7461
## F-statistic: 8.347 on 2 and 3 DF, p-value: 0.05945
Looking through ?model.matrix enlightens us a bit as to the output of model.matrix(model) Note that we find the estimate for \(\alpha_1,\,\hat\alpha_1\) under [A1,Estimate Std.], \(\alpha_2,\,\hat\alpha_2\) under [A2,Estimate Std.] and the estimate for \(\sigma,\,???\) under Residual standard error We may extract these three estimates with the below commands, noting that
model$coefficients[[1]] #alphahat_1
## [1] 0.385
model$coefficients[[2]] #alphahat_2
## [1] 1.4855
summary(model)$sigma #sigmatilde
## [1] 0.7570445
sigmatilde2 <- (summary(model)$sigma)^2 #sigmatilde^2
## [1] 0.5731163
Having checked ?vcov, note the output of vcov and solve(t(A)A)-commands;
vcov(model)
## A1 A2
## A1 0.1146233 0.0000000
## A2 0.0000000 0.1432791
solve(t(A)%*%A)
## [,1] [,2]
## [1,] 0.2 0.00
## [2,] 0.0 0.25
On an inkling that there “might” be a connection between sigmatilde,vcov(model),solve(t(A)%*%A) note that
solve(t(A)%*%A)*sigmatilde2 #Fits rather well with vcov(model)
## [,1] [,2]
## [1,] 0.1146233 0.0000000
## [2,] 0.0000000 0.1432791
Referring back to EH 10.5 and the distribution of the MLE mean \(\hat\alpha\sim\mathcal{N}\mathopen{}\left({{\alpha},{\tilde\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}}}\right)\mathclose{},\) this isn’t entirely surprising, when noting the fact that vcov(model) - by ?vcov itself describes its functionality as;
Referring back to HS2 in which the extraction of standard errors for estimates based on some linear regression was first carried out, turns out to give us clues as to the connections.
sealpha_1 <- sqrt(diag(vcov(model)))[[1]]
## [1] 0.3385606
sealpha_2 <- sqrt(diag(vcov(model)))[[2]]
## [1] 0.3785222
Note then that \[ SE_{\hat\alpha_1}=\sqrt{\underset{\text{vcov(model)[1,1]}}{\underbrace{\tilde\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}_{11}}}}\\ SE_{\hat\alpha_2}=\sqrt{\underset{\text{vcov(model)[2,2]}}{\underbrace{\tilde\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}_{22}}}} \]
Note that we are running seed 314 throughout this RMD document.
We may read the data into and define the requested variables with
paddy <- read.csv("paddy.txt", sep="")
head(paddy)
## days yield
## 1 16 2508
## 2 18 2518
## 3 20 3304
## 4 22 3423
## 5 24 3057
## 6 26 3190
days <- paddy$days
yield <- paddy$yield
daysSqr <- days^2
kvadreg <- lm(yield ~days+daysSqr)
summary(kvadreg)
##
## Call:
## lm(formula = yield ~ days + daysSqr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -303.96 -118.11 13.86 115.67 319.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1070.3977 617.2527 -1.734 0.107
## days 293.4829 42.1776 6.958 9.94e-06 ***
## daysSqr -4.5358 0.6744 -6.726 1.41e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 203.9 on 13 degrees of freedom
## Multiple R-squared: 0.7942, Adjusted R-squared: 0.7625
## F-statistic: 25.08 on 2 and 13 DF, p-value: 3.452e-05
we have made made the regression model as written in EH (11.15). Note that we may collect estimates of \(\hat\alpha,\,\hat\beta\) and \(\hat\gamma\) with the code
alphah <- kvadreg$coefficients[[1]]
## [1] -1070.398
betah <- kvadreg$coefficients[[2]]
## [1] 293.4829
gammah <- kvadreg$coefficients[[3]]
## [1] -4.535802
tsigma2 <- (summary(kvadreg)$sigma)^2
## [1] 41568.34
and that these match the estimates reported in EH Example 11.10.
We will be plotting the standard residuals against the estimated (fitted) values for yield, in the same way we did (eg.) in HS2;
fit <- fitted(kvadreg)
rst <- rstandard(kvadreg)
plot(fit, rst, main = "(Estimate, Std. Res.)-plot for kvadreg model on paddy", xlab = "Estimated (fitted) yield-values (kg/ha)", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) ) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
abline(0,0)
Looking at the above plot, we note that if the assumption that the variance of the yield data is independent of the mean is true, we would expect that the dispersion of points at region of fitted yield values would be akin to the dispersion in any other region. If the “linearity” assumption of our model were to be true, we would expect the points to be relatively equally dispersed as \(\mathcal{N}(0,1)\) on both sides of \(0\) across the fitted values. - ie. that the standardized residuals are themselves rather \(\mathcal{N}(0,1)\) distributed.
The plot looks rather respectable on both these fronts, as 1. The (std.) residual values fall within the interval \(\left({-1.5,1.5}\right),\) as would be expected of a standard normal distribution. 2. No wierd “trumpet,” “quadratic” or other wierd distortions to the std. residuals values are present, they seem to keep rather constantly dispersed around zero within any fitted yield region.
We may run the requested commands in , simulating \(16\) new datapoints from the normal distribution with mean and variance parameters defined as the empirical parameters of the original data. The implementation;
xi0 <- fitted(kvadreg)
sigma0 <- summary(kvadreg)$sigma
## [1] 203.8831
simYield <- rnorm(16, xi0, sigma0)
We note that the simulated scatterplot retains a lot of the characteristics of the original, though it is definitely different to the original.
As we did with the original data, we will just as well fit a quadratic regression model to the simulated data.
yield2 <- simYield
kvadreg2 <- lm(yield2 ~ days+daysSqr)
alphah2 <- kvadreg2$coefficients[[1]]
## [1] -868.701
betah2 <- kvadreg2$coefficients[[2]]
## [1] 279.8275
gammah2 <- kvadreg2$coefficients[[3]]
## [1] -4.365983
tsigma2_2 <- (summary(kvadreg2)$sigma)^2
## [1] 68692.48
We may also create a new std. residual plot for the simulated data
fit2 <- fitted(kvadreg2)
rst2 <- rstandard(kvadreg2)
plot(fit2, rst2, main = "(Estimate, Std. Res.)-plot for kvadreg2 model on paddy", xlab = "Estimated (fitted) yield-values (kg/ha)", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(rst2))), max(3.2,max(abs(rst2)))) ) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
abline(0,0)
Which once again doesn’t seem to have much of anything out of the ordinary.
We may define a simulation factory function Sim. Do also note the functionality of the par function with ?par
Conclusions….
The estimates can be obtained with the below code.
coef(kvadreg)
## (Intercept) days daysSqr
## -1070.397689 293.482948 -4.535802
est <- as.numeric(coef(kvadreg))
## [1] -1070.397689 293.482948 -4.535802
Observe that we are dealing with the model \[
Yield_i=\gamma\cdot daysSqr_i+\beta\cdot days_i+\alpha\equiv\gamma\cdot days_i^2+\beta\cdot days_i+\alpha
\] and as such we know from elementary mathematics that the stationary point of \(Yield,\,T_{Yield}\equiv\mathopen{}\left({optDay,optYield}\right)\mathclose{}\) may be found at \[
T_{Yield}\equiv\mathopen{}\left({optDay,optYield}\right)\mathclose{}=\mathopen{}\left({\frac{-\beta}{2\gamma},\frac{-\mathopen{}\left({\beta^2-4\gamma\alpha}\right)\mathclose{}}{4\gamma}}\right)\mathclose{}
\] As such we may estimate the location of the stationary (toppoint) point with \[
\hat T_{Yield}\equiv\mathopen{}\left({\frac{-\hat\beta}{2\hat\gamma},\frac{-\mathopen{}\left({\hat\beta^2-4\hat\gamma\hat\alpha}\right)\mathclose{}}{4\hat\gamma}}\right)\mathclose{}
\] in with
optDay <- -est[2]/(2*est[3])
## [1] 32.35183
optYield <- - (est[2]^2 - 4*est[1]*est[3]) / (4* est[3])
## [1] 3676.957
The time of harvest days cannot be written as an affine transformation of our parameters as was the premise in Chapter 9 and 10. Specifically it will not be possible to find a matrix \(Q\) such that \(Edays_i=Q\begin{pmatrix}{\alpha\\\beta\\\gamma}\end{pmatrix},\) as isolation of days in the model would require the use of squaring and square roots (decidedly non-affine) in accordance with the quadratic formula.
We may repeat the tasks of subproblem 6 with the data previously simulated in simYield
estSim <- as.numeric(coef(kvadreg2))
## [1] -868.700973 279.827547 -4.365983
optDaySim <- -estSim[2]/(2*estSim[3])
optYieldSim <- - (estSim[2]^2 - 4*estSim[1]*estSim[3]) / (4*estSim[3])
c(optDaySim,optYieldSim)
## [1] 32.04634 3615.02303
We may repeat the above simulation \(1000\) times with
n <- 10^3
resSim <- matrix(NA, n, 2) # Initiation of a matrix
for (i in 1:n) {
simYield <- rnorm(16, xi0, sigma0)
kvadregSim <- lm(simYield ~ paddy$days + daysSqr)
estSim <- as.numeric(coef(kvadregSim))
optDaySim <- -estSim[2]/2/estSim[3]
optYieldSim <- - (estSim[2]^2 - 4*estSim[1]*estSim[3]) / 4 / estSim[3]
resSim[i,1] <- optDaySim
resSim[i,2] <- optYieldSim
}
We may draw histograms with
par(mfrow=c(2,1))
hist(resSim[,1], main="optDay", breaks = 30)
hist(resSim[,2], main="optYield", breaks = 30)
It doesn’t seem unreasonable to assume that a normal distribution could fit the data rather well;
fDay <- function(x) dnorm(x, mean = mean(resSim[,1]) , sd = sd(resSim[,1])) #normalfordelingslinien
fYield <- function(x) dnorm(x,mean=mean(resSim[,2]), sd = sd(resSim[,2]))
par(mfrow=c(2,1))
hist(resSim[,1], main="optDay", prob=T); plot(fDay, 30,35, add=T, col="blue")
hist(resSim[,2], main="optYield", prob=T); plot(fYield, 3400,3900,add=T, col="hot pink")
One way to get approximate \(95\%\) confidence intervals for each of optDay,optYield is to use the quantile function (see ?quantile);
quantile((resSim[,1]),c(0.025,0.975))
## 2.5% 97.5%
## 31.27306 33.85067
quantile((resSim[,2]),c(0.025,0.975))
## 2.5% 97.5%
## 3536.514 3827.813
Based on 16 simulated outcomes, we will be doing a “linear, linear-regression” in order to point out problems with the mean that arise in this regard, when doing visual model-control with the likes of residual plots.
simYield <- rnorm(16, xi0, sigma0)
linRegSim <- lm(simYield ~ paddy$days)
linresSim <- rstandard(linRegSim)
linfitSim <- fitted(linRegSim)
plot(linfitSim, linresSim, main = "(Estimate, Std. Res.)-plot for linear linreg model on simulated data", xlab = "Estimated (fitted) Yield", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(linresSim))), max(3.2,max(abs(linresSim)))) );abline(0,0)
Note the clear parabolic shape of the point of the std.-residual plot above. While the points, as they should, fall within \((-2,2)\) the manner in which they do so via the aforementioned parabolic shape suggests that an assumption of the linear relation
\[ EsimYield_i=\zeta+\varphi\cdot days_i \] is rather implausible.
We may, just as in subproblem 5 repeat the process of simulation;
SimNPlot<-function(i) {
simYield <- rnorm(16, xi0, sigma0)
linRegSim <- lm(simYield ~ days)
fit <- fitted(linRegSim)
rst <- rstandard(linRegSim)
plot(fit, rst, xlim=c(3000,3600), main = "", xlab = "", ylab = "", ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) )
#Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
abline(0,0)
}
par(mfrow = c(3,3), mar=c(0.55,0.55,0.55,0.55))
for (i in 1:9) SimNPlot(i)
Looking at the different simulated plots, notice that a fair bit of variety appears, though non more so, that we in each case could say that the linearity assumption is improbable to hold true.
Below, we will be using the suggested commands to explore the result of fudging with the variance when doing std. residual plots. Note that one may examine
?points for further details regarding this command.
newSD <- 25*(15-abs(paddy$days-31))
newSD
## [1] 0 50 100 150 200 250 300 350 350 300 250 200 150 100 50 0
simYield <- rnorm(16, xi0, newSD)
plot(paddy$days, simYield)
points(paddy$days, xi0, type="l")
Above, we are manipulating the standard deviation, as shown when calling
newSD, that assigns a symmetrically high deviation to the “middle simulations” of the simulated \(16\)-datapoints - ie. we assign the highest deviation to the \(8,9\)th draw from the normal distribution, and then progressively and symmetrically lower deviation towards the “edges.”
The result of the above procedure when plotting simYield against days is that as days are ordered, the i’th Yield-simulation (the middle of which have now been simulated with greater variance) will be assumed to be connected with the corresponding i’th day.
Consequently for the “quadratic” standardized-residual plot
kvadregSim <- lm(simYield ~ paddy$days + daysSqr)
resSim <- rstandard(kvadregSim)
fitSim <- fitted(kvadregSim)
plot(fitSim, resSim, main = "(Estimate, Std. Res.)-plot", xlab = "Estimated (fitted) Yield", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(linresSim))), max(3.2,max(abs(linresSim)))) );abline(h=0,lty=2)
Note that the highest fitted
Yield values seem to be significantly more spread out than the rest.
We may simulate more datasets to explore the reproducibility of this result
par(mfrow = c(3,3), mar=c(0.55,0.55,0.55,0.55))
for (i in 1:9){
simYield <- rnorm(16, xi0, newSD)
kvadregSim <- lm(simYield ~ paddy$days + daysSqr)
resSim <- rstandard(kvadregSim)
fitSim <- fitted(kvadregSim)
plot(fitSim,resSim, xlim=c(2400,4000), main = "", xlab = "", ylab = "", ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst))))) ;abline(h=0,lty=2)
}
The simulated datasets seem rather “trumpet shaped” as in it seems that the standard residuals posses an increased variance toward the higher
Yield values. - This is in accordance with the procedure by which data was simulated, as we have simulated larger standard deviation for the middle datapoints, which, in accordance with the quadratic linear regression model, also tend to have larger value - ie. we would expect to see larger variance for the larger values of the dataset. This in return will then transfer onto the residuals, which we, in keeping with plots above, are expected to be more spread out, towards the larger Yield values.
If one were primarily to be interested in predicting fat percentages through “the use of the other variables,” one might try a linear regression model of the form;
We may start off with importing the relevant dataset;
bodyfat <- read_table2("bodyfat.txt")
##
## -- Column specification --------------------------------------------------------
## cols(
## Fat = col_double(),
## Triceps = col_double(),
## Thigh = col_double(),
## Midarm = col_double()
## )
head(bodyfat)
## # A tibble: 6 x 4
## Fat Triceps Thigh Midarm
## <dbl> <dbl> <dbl> <dbl>
## 1 11.9 19.5 43.1 29.1
## 2 22.8 24.7 49.8 28.2
## 3 18.7 30.7 51.9 37
## 4 20.1 29.8 54.3 31.1
## 5 12.9 19.1 42.2 30.9
## 6 21.7 25.6 53.9 23.7
as well as using the commands
plot(bodyfat)
cor(bodyfat)
## Fat Triceps Thigh Midarm
## Fat 1.0000000 0.8432654 0.8780896 0.1424440
## Triceps 0.8432654 1.0000000 0.9238425 0.4577772
## Thigh 0.8780896 0.9238425 1.0000000 0.0846675
## Midarm 0.1424440 0.4577772 0.0846675 1.0000000
Looking at ?cor one might conclude that cor generates a correlation matrix for the dataset. The “wierd” appearance of the above plot is caused by the fact that we have a total of four variables at play in the dataset, such that there are \(4\cdot 4 = 16\) different ways of plotting one plot against another. Of the \(16\) however, four include plotting of the variables against themself, which is rather silly, and thus the resulting plot is created by .
We will be fitting the model from subproblem 1 using lm
m1 <- lm(Fat ~ Triceps + Thigh + Midarm, data = bodyfat)
We might also do a visual modelcontrol through the use of a standardized residual plot;
rst <- rstandard(m1)
fit <- fitted(m1)
plot(fit, rst, main = "(Estimate, Std. Res.)-plot", xlab = "Estimated (fitted) Fat Percentage", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) );abline(h=0,lty=2)
Which looks rather respectable on all fronts.
We may look at the summary(m1) object;
summary(m1)
##
## Call:
## lm(formula = Fat ~ Triceps + Thigh + Midarm, data = bodyfat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7263 -1.6111 0.3923 1.4656 4.1277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117.085 99.782 1.173 0.258
## Triceps 4.334 3.016 1.437 0.170
## Thigh -2.857 2.582 -1.106 0.285
## Midarm -2.186 1.595 -1.370 0.190
##
## Residual standard error: 2.48 on 16 degrees of freedom
## Multiple R-squared: 0.8014, Adjusted R-squared: 0.7641
## F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06
Amongst the coefficient parameters we find the t-static values. There are of the form \[
t_{\hat\zeta}=\frac{\hat\zeta-\zeta_0}{se(\hat\zeta)},
\] for some corresponding parameter \(\zeta\) of our model. By default sets \(\zeta_0=0,\) leaving the
t values that appear, resulting from division of the estimate for the parameter with its standard error, ie with \(\zeta_0=0\) be have \(t_{\hat\zeta}=\frac{\hat\zeta}{se(\hat\zeta)}.\)
With these t values the corresponding Pr(>|t|)-values are the symmetric tail probabilities (p values) of getting t values of greater magnitude than \(t_{\hat\zeta}\) in a student t - distribution with m1$df.residual=16 degrees of freedom, ie. p values: probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct, such that small values are critical to the hypothesis. We may formalize the writing of the p-values; For \(T\sim t_{16}\), the to \(\zeta\) corresponding Pr(>|t|) value, is \[
P(|T|>|t|)\overset{\text{symmetry}}{=}2P(T>|t|).
\] As an example of a manual calculation of one of the p-values being carried out, we could approach the hypothesis; \[
H:\,\alpha = 0
\] which by the model corresponds to the hypothesis that the intercept is \(0.\)
(degfree <- m1$df.residual)
## [1] 16
hatalpha <- coef(summary(m1))[1,1]
## [1] 117.0847
sealpha <- coef(summary(m1))[1,2]
## [1] 99.7824
t_alpha <- coef(summary(m1))[1,3]
## [1] 1.1734
pt_alpha <- 2*pt(t_alpha, degfree, lower.tail = F)
## [1] 0.2578078
pt_alpha-coef(summary(m1))[1,4] #Is there a difference between our p-value, and summary's? <- nope.
## [1] 0
Looking at the p-value vector
pvalvec <- coef(summary(m1))[,4]
pvalvec
## (Intercept) Triceps Thigh Midarm
## 0.2578078 0.1699111 0.2848944 0.1895628
we might conclude that while there is not a lot of support for the various “marginal” null hypotheses, the p values are at the same time not so critical (small) that we may immediately dismiss any of the null hypotheses.
After looking through ?anova, we may use the recommended commands
m2 <- lm(Fat ~ Midarm, data = bodyfat)
anova(m2,m1)
## Analysis of Variance Table
##
## Model 1: Fat ~ Midarm
## Model 2: Fat ~ Triceps + Thigh + Midarm
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 18 485.34
## 2 16 98.40 2 386.93 31.456 2.856e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion…
We may import the dataset;
engel <- read_table2("engel.txt") #read_table2 = whitespace delimiter
##
## -- Column specification --------------------------------------------------------
## cols(
## income = col_double(),
## foodexp = col_double()
## )
Having registered both the income and the amount spent on food for different households, one might imagine that the amount spent on food could within some sort of causal structure be seen as a function of the income of the corresponding household. We may hold the hypothesis that a household with more income might translate to a household with more disposable income, and that a household with more disposable income would then dispence more of this towards food. We might create the corresponding scatterplot
foodexp <- engel$foodexp #Note that 'attach' only has scope of the current R-chunck
income <- engel$income #Note that 'attach' only has scope of the current R-chunck
p <- ggplot(data = engel)
p_1 <- geom_point(mapping = aes(x=income, y = foodexp))
p+p_1
It seem like there is a central body of observations that could have some sort of underlying linear trend going on, though it appears as if the observations are more spread out the higher the income within the central body. There also seem to be three outliers from this central body of observations, that are probably going to have a fair bit of sway in deriving a linear fit.
We may fit this model in with
model <- lm(foodexp ~ income, data = engel)
We might then do a visual model control through a standardized residual plot, and a QQ-plot;
fit <- fitted(model)
rst <- rstandard(model)
library(gridExtra)
p_2 <- qplot(fit, rst, main = "(Estimate, Std. Res.)-plot for linreg model on engel", xlab = "Estimated (fitted) foodexpence", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) )+geom_hline(yintercept = 0) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
p_3 <- p + geom_qq() + geom_qq_line() + aes(sample = rst)
grid.arrange(p_2,p_3, ncol = 2)
From the residual plot we might confirm our suspicions regarding a non-constant variance, as we see the variance seemingly growing with increases in mean such that the variance homogeneity assumption seems unreasonable. There also seems to be a signifigant number of tail-outliers from the central trend of the qq-plot, we might also conclude that the normality-assumption regarding foodexp is also unreasonable.
With the theory of EH Chapter 11.3 at our back, we may log-transform both variables;
lfood <- log(foodexp)
lincome <- log(income)
Our new statistical model will be for
One might get the idea that a log-transformation of the “original model,” might be prudent, as the log-transform will serve to stabilize the variability of especially increasing variance values.
We will once again fit a linear regression model noting that the “linear” in linear regression relates to how the parameters appear in the model (thus “linear” should really be interpreted as “affine”). In ;
lmodel <- lm(lincome ~ lfood)
lfit <- fitted(lmodel)
lrst <- rstandard(lmodel)
p_4 <- qplot(lfit, lrst, main = "(Estimate, Std. Res.)-plot for log-linreg model on engel", xlab = "Estimated (fitted) lfood", ylab ="Standardized residuals", ylim = c(-max(3.2,max(abs(lrst))), max(3.2,max(abs(lrst)))) )+geom_hline(yintercept = 0) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
p_5 <- p + geom_qq() + geom_qq_line() + aes(sample = lrst)
grid.arrange(p_4,p_5, ncol = 2)
The log transformation seem to have done some good, as the residual plot seem to hold up to both the linearity assumption of the means, and to the variance homogeneity assumptions, with only a few outliers. The normality assumption also seems like a rather reasonable approximation, once again with only a few outliers.
We may collect the major parameters of the model with
lalpha <- lmodel$coefficients[[1]]
## [1] 0.2249831
lbeta <- lmodel$coefficients[[2]]
## [1] 1.032704
ltsigma2 <- (summary(lmodel)$sigma)^2
## [1] 0.02257692
For the model in subproblem 3
We may start by importing the dataset
pillbug <- read_table2("pillbug.txt",
col_types = cols(group = col_factor(levels = c("Light",
"Moisture", "Control"))))
head(pillbug)
## # A tibble: 6 x 2
## time group
## <dbl> <fct>
## 1 23 Light
## 2 12 Light
## 3 29 Light
## 4 12 Light
## 5 5 Light
## 6 47 Light
By EH page 499, the point of a one-sided analysis of variance, is the test of the model with the assumption that data with the same labels are to have come from the same normal distribution, while there might be a difference in mean between the different labels.\ As such, we may make the experiment in a variety of different ways. The simplest experiment in which we want to test whether exposure to light or moisture will affect time, is analysing whether any of the treatments have an impact at all, through the constant factor system \(L_1.\) In terms of the one-dimensional analysis of variance, this results from the \(L_1\) assumption that all the groups will be distributed the same way, but as one of the treatment is ‘no treatment,’ this will be the hypothesis that there that none of the treatments are effective, ie. that \(\xi\equiv\mathopen{}\left({\xi_i}\right)\mathclose{}_{i\in I}\in L_1,\) such that \(X_i\sim\mathcal{N}\mathopen{}\left({{\xi},{\sigma^2}}\right)\mathclose{} \forall i\in I.\) \ Note especially that we are dealing with a treatment factor \(t:I\rightarrow T\) with \(T=\mathopen{}\left\{{No treatment, light, moisture}\right\}\mathclose{}.\) As such we would be looking at the model that in constructed via the following
lm-commands, with the thereon following designmatrix.
lm_L1 <- lm(pillbug$time ~ 1, data = pillbug)
model.matrix(pillbug$time ~ 1, data = pillbug)
## (Intercept)
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 1
## 11 1
## 12 1
## 13 1
## 14 1
## 15 1
## 16 1
## 17 1
## 18 1
## 19 1
## 20 1
## 21 1
## 22 1
## 23 1
## 24 1
## 25 1
## 26 1
## 27 1
## 28 1
## 29 1
## 30 1
## 31 1
## 32 1
## 33 1
## 34 1
## 35 1
## 36 1
## 37 1
## 38 1
## 39 1
## 40 1
## 41 1
## 42 1
## 43 1
## 44 1
## 45 1
## 46 1
## 47 1
## 48 1
## 49 1
## 50 1
## 51 1
## 52 1
## 53 1
## 54 1
## 55 1
## 56 1
## 57 1
## 58 1
## 59 1
## 60 1
## attr(,"assign")
## [1] 0
Alternatively, we may look at the one-sided variance through the lens of the hypotheses that there really might be a difference in effect of the different treatments. When dealing with the ‘treatment’ factor group we may therefore construct the analysis in as is done in the following code chunk. Note in particular that the first uncommented line is present in order to change the intercept reference from being that of the alphabetically first factor, to
Control - One might compare the derived design matrices with and without this relevelling. See also reorder() and factor(). Note also that multiple use of relevel might do the required permuting of the levels as is seen in the commented line.
#pillbug$group <- relevel(pillbug$group, ref = "Moisture") #relevel M before C, gives order; C, M, L
pillbug$group <- relevel(pillbug$group, ref = "Control")
lm_LT <- lm(time ~ group, data = pillbug)
model.matrix(time ~ group, data = pillbug)
## (Intercept) groupLight groupMoisture
## 1 1 1 0
## 2 1 1 0
## 3 1 1 0
## 4 1 1 0
## 5 1 1 0
## 6 1 1 0
## 7 1 1 0
## 8 1 1 0
## 9 1 1 0
## 10 1 1 0
## 11 1 1 0
## 12 1 1 0
## 13 1 1 0
## 14 1 1 0
## 15 1 1 0
## 16 1 1 0
## 17 1 1 0
## 18 1 1 0
## 19 1 1 0
## 20 1 1 0
## 21 1 0 1
## 22 1 0 1
## 23 1 0 1
## 24 1 0 1
## 25 1 0 1
## 26 1 0 1
## 27 1 0 1
## 28 1 0 1
## 29 1 0 1
## 30 1 0 1
## 31 1 0 1
## 32 1 0 1
## 33 1 0 1
## 34 1 0 1
## 35 1 0 1
## 36 1 0 1
## 37 1 0 1
## 38 1 0 1
## 39 1 0 1
## 40 1 0 1
## 41 1 0 0
## 42 1 0 0
## 43 1 0 0
## 44 1 0 0
## 45 1 0 0
## 46 1 0 0
## 47 1 0 0
## 48 1 0 0
## 49 1 0 0
## 50 1 0 0
## 51 1 0 0
## 52 1 0 0
## 53 1 0 0
## 54 1 0 0
## 55 1 0 0
## 56 1 0 0
## 57 1 0 0
## 58 1 0 0
## 59 1 0 0
## 60 1 0 0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
Alternatively, one might look at the importation of the data set into and specify the data in the following manner:
pillbug <- read_table2("pillbug.txt",
col_types = cols(group = col_factor(levels = c("Control", "Light",
"Moisture"))))
As requested, we will be trying the command
boxplot(time~group, data=pillbug)
Having seen the result, we might investigate its origin, and the functionality of the
boxplot command with ?boxplot. As such, the whiskers mark the minimum and maximum of the data excluding ‘outliers.’ The main body of the boxplot is the Inter Quantile Range (IQR) of the \(25^{th}, 50^{th}, 75^{th}\) percentile quantile, noting that the \(50^{th}\) is the median, and is marked by the filled line, in each of the groups. We might do the similar plot in ggplot2 with
ggplot(pillbug, aes(x=group, y=time, color=group)) +
geom_boxplot() + theme(legend.position = "none")
\ We might hereafter use our above constructed
lm_LT model, and make the attached residual plot;
fitLT <- fitted(lm_LT)
rstLT <- rstandard(lm_LT)
Stdresplot <- function(model, main = paste("(Estimate, Std. Res.)-plot of", deparse(substitute(model))), ylab ="Standardized residuals", ...) {
fit <- fitted(model)
rst <- rstandard(model)
qplot(fit, rst, main = main, ylab = ylab, ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) )+geom_hline(yintercept = 0) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
}
QQplotdraw <- function(model, main = paste("Normal QQ-plot of", deparse(substitute(model))), xlab = "Theoretical Quantiles", ylab ="Sample Quantiles", ...) {
rst <- rstandard(model)
#dataname <- getCall(lm_LT)$data
ggplot(data = eval(getCall(model)$data), main = main, xlab = xlab, ylab = ylab) + geom_qq() + geom_qq_line() + aes(sample = rst)
} #main, xlab, ylab call do not work for some reason
p1 <- Stdresplot(lm_LT)
p2 <- QQplotdraw(lm_LT)
library(gridExtra)
grid.arrange(p1,p2, ncol = 2)
We see from the standard residualplot, the linearity assumption of the model seems very possible, while the cluster having lower times (which turns out happen to be the cases of
light) seems to have significantly lower variance than the other cluster with larger times. - We might therefore question the variance homogeneity of the data. \ Looking at the QQplot we see a definite offset from the line with intersect 0 and slope 1, that is to be expected of the matching of the theoretical \(\mathcal{N}\mathopen{}\left({{0},{1}}\right)\mathclose{}\) quantiles with the sample quantiles of the standard residuals.
As requested, we may log-transform time in our linear regression model. Doing this in with
lm_logLT <- lm(I(log(time)) ~ group, data = pillbug)
StdresQQPlot <- function(model,...) {
p1 <- Stdresplot(model,...)
p2 <- QQplotdraw(model,...)
library(gridExtra)
grid.arrange(p1,p2, ncol = 2)
}
StdresQQPlot(lm_logLT)
ggplot(pillbug, aes(x=group, y=log(time), color=group)) +
geom_boxplot() + theme(legend.position = "none")
we see that the QQplot is still not entirely relateable, but that the variance homogeneity now seems significantly more appropriate. We thus see that the logarithm has a variance stabilising effect when applied to time.
The model of the hypothesis that neither moisture or light cause any change in running time, is the model that the mean vector \(\xi\equiv\mathopen{}\left({\xi_i}\right)\mathclose{}_{i\in I}\in L_1,\) i.e. that there is no differnece in distribution of the data marked with each of the labels, as one of the labels literaly is ‘no treatment.’ As such note the above definition of lm_L1 in , that we, according to subproblem 3 might log-transform the time of, in order to better fit our model assumptions of variance homogeneity. Note that the anova test performed, is of whether it makes sense to divide up the data amongst the groups as is done in the \(L_T\) hypothesis of assuming that the distribution within the groups are similar, and whether we might reduce this assumption, to there being no difference amongst the groups;
lm_logL1 <- lm(I(log(time))~1, data = pillbug)
(an <- anova(lm_logL1,lm_logLT)) #Note the ordering of the models, such that the smallest modell is first.
## Analysis of Variance Table
##
## Model 1: I(log(time)) ~ 1
## Model 2: I(log(time)) ~ group
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 59 76.677
## 2 57 23.567 2 53.11 64.227 2.498e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Doing the analysis of variance between the model grouped by the individual treatments and \(L_1\) (the second row), note that we have an \(F\)-test with (2,57) degrees of freedom. The \(2\) in (2,57) refers to the fact that we from the grouping model, to the collected model have a reduction of groups of two, as we go from having three different treatment groups (no treatment, light and moisture) to having just one (L_1: all the data is the same group).
With a \(F\) value of 64.227 corresponding to a p-value of \(\cong2.5\cdot10^{-15},\) we may with great ease lay the hypothesis that these pillbugs have the same running time no matter the treatment to rest, as this from the data seems exceedingly unlikely.
We may as requested report the parameters of the model in subquestion 3 (with control intercept) as
summary(lm_logLT)
##
## Call:
## lm(formula = I(log(time)) ~ group, data = pillbug)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5454 -0.2037 0.1510 0.4448 0.8602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.99207 0.14378 34.720 < 2e-16 ***
## groupLight -2.00211 0.20334 -9.846 6.61e-14 ***
## groupMoisture -0.01266 0.20334 -0.062 0.951
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.643 on 57 degrees of freedom
## Multiple R-squared: 0.6926, Adjusted R-squared: 0.6819
## F-statistic: 64.23 on 2 and 57 DF, p-value: 2.498e-15
Note especially that from this that the estimate for groupMoisture is roughly zero, such that the data doesn’t change much with the moisture. Note also that it is reported that the estimate for the change of the light exposure group when compared to the control is a change of logtime of roughly \(-2\) - some that makes a lot of sense when looking at the boxplots of control and light in subproblem 3. Finally, the anova conclusion reached in the subquestion 4 is present at the bottom of the summary, in the sense that the F static, along with degrees of freedom and p-value are present.
confint(lm_logLT)
## 2.5 % 97.5 %
## (Intercept) 4.7041572 5.2799872
## groupLight -2.4092865 -1.5949399
## groupMoisture -0.4198323 0.3945143
We see from the above, that as \(0\) is not contained in the \(95\%\) for light, and could thus be said to be significant in this sense. The opposite can be said for groupMoisture, for which zero is contained in the interval.
Det dokument gennemgår analysen fra EH eksempel 11.13 i R. Dokumentet viser nogle R-tekniske løsninger, og kommentarerne omkring selve analysen er meget sparsomme, og skal ikke ses som fyldestgørende for hvordan en praktisk dataanalyse skal se ud.
Vi vil bruge diverse funktioner fra Tidyverse, så pakken indlæses først.
library(tidyverse)
Vi indlæser data fra EH eksempel 11.13.
CAPM <- read_csv("CAPM.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Month = col_character(),
## r = col_double(),
## M = col_double(),
## R = col_double()
## )
Scatter plots af de tre variable mod hinanden.
qplot(r, R, data = CAPM, size = I(3), color = I("blue"))
qplot(M, R, data = CAPM, size = I(3), color = I("blue"))
qplot(r, M, data = CAPM, size = I(3), color = I("blue"))
Vi kan også plotte af de tre variable over tid, som giver god mening for det her datasæt.
qplot(factor(Month, levels = Month), R, data = CAPM, size = I(2), color = I("blue")) + geom_line(aes(group = 1), color = "blue")
qplot(factor(Month, levels = Month), M, data = CAPM, size = I(2), color = I("red")) + geom_line(aes(group = 1), color = "red")
qplot(factor(Month, levels = Month), r, data = CAPM, size = I(2), color = I("violet")) + geom_line(aes(group = 1), color = "violet")
Vi genfinder nu den fittede model fra EH eksempel 11.13.
CAPM_lm <- lm(R ~ M + r, data = CAPM)
summary(CAPM_lm)
##
## Call:
## lm(formula = R ~ M + r, data = CAPM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.7829 -2.6282 -0.4654 3.4771 11.5701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.0837 6.1645 -1.636 0.1203
## M 0.1556 0.3371 0.462 0.6502
## r 29.0208 14.4096 2.014 0.0601 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.394 on 17 degrees of freedom
## Multiple R-squared: 0.2035, Adjusted R-squared: 0.1098
## F-statistic: 2.171 on 2 and 17 DF, p-value: 0.1446
Vi kan konstruere designmatricen ved “håndkraft” fra data på følgende måde:
A <- matrix(c(rep(1, 20), CAPM$M, CAPM$r), nrow = 20)
A
## [,1] [,2] [,3]
## [1,] 1 8.549 0.496
## [2,] 1 -0.173 0.510
## [3,] 1 4.199 0.532
## [4,] 1 -0.630 0.569
## [5,] 1 -0.565 0.543
## [6,] 1 5.106 0.581
## [7,] 1 7.012 0.543
## [8,] 1 -0.346 0.475
## [9,] 1 6.312 0.370
## [10,] 1 2.949 0.316
## [11,] 1 -1.556 0.340
## [12,] 1 2.432 0.316
## [13,] 1 8.631 0.309
## [14,] 1 0.123 0.315
## [15,] 1 10.174 0.314
## [16,] 1 -4.737 0.350
## [17,] 1 2.498 0.334
## [18,] 1 -5.460 0.346
## [19,] 1 0.921 0.351
## [20,] 1 5.283 0.371
Vi kan nu genfinde estimaterne ved løsning af et lineært ligningssystem.
solve(t(A) %*% A, t(A) %*% CAPM$R)
## [,1]
## [1,] -10.0837178
## [2,] 0.1555781
## [3,] 29.0207852
Designmatricen kan vi også trække ud af lm-objektet eller konstruere direkte med funktionen model.matrix()
model.matrix(CAPM_lm)
## (Intercept) M r
## 1 1 8.549 0.496
## 2 1 -0.173 0.510
## 3 1 4.199 0.532
## 4 1 -0.630 0.569
## 5 1 -0.565 0.543
## 6 1 5.106 0.581
## 7 1 7.012 0.543
## 8 1 -0.346 0.475
## 9 1 6.312 0.370
## 10 1 2.949 0.316
## 11 1 -1.556 0.340
## 12 1 2.432 0.316
## 13 1 8.631 0.309
## 14 1 0.123 0.315
## 15 1 10.174 0.314
## 16 1 -4.737 0.350
## 17 1 2.498 0.334
## 18 1 -5.460 0.346
## 19 1 0.921 0.351
## 20 1 5.283 0.371
## attr(,"assign")
## [1] 0 1 2
model.matrix(~ M + r, data = CAPM)
## (Intercept) M r
## 1 1 8.549 0.496
## 2 1 -0.173 0.510
## 3 1 4.199 0.532
## 4 1 -0.630 0.569
## 5 1 -0.565 0.543
## 6 1 5.106 0.581
## 7 1 7.012 0.543
## 8 1 -0.346 0.475
## 9 1 6.312 0.370
## 10 1 2.949 0.316
## 11 1 -1.556 0.340
## 12 1 2.432 0.316
## 13 1 8.631 0.309
## 14 1 0.123 0.315
## 15 1 10.174 0.314
## 16 1 -4.737 0.350
## 17 1 2.498 0.334
## 18 1 -5.460 0.346
## 19 1 0.921 0.351
## 20 1 5.283 0.371
## attr(,"assign")
## [1] 0 1 2
Først plotter vi de rå residualer
qplot(fitted(CAPM_lm), residuals(CAPM_lm)) +
geom_hline(yintercept = 0) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
og dernæst residualerne standardiseret med det centrale estimat for spredningen.
sigma_hat <- sqrt(sum(residuals(CAPM_lm)^2) / 17) # Centralt variansestimate
qplot(fitted(CAPM_lm), residuals(CAPM_lm) / sigma_hat) +
geom_hline(yintercept = 0) +
geom_hline(yintercept = -2, linetype = 2) +
geom_hline(yintercept = 2, linetype = 2) +
geom_smooth() +
ylim(-3, 3)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Endelig plotter vi de standardiserede residualer (røde nedenfor), hvor der altså også er devideret med \(\sqrt{1 - h_{ii}}\), sammenholdt med residualerne ovenfor.
qplot(fitted(CAPM_lm), rstandard(CAPM_lm), color = I("red")) +
geom_point(aes(y = residuals(CAPM_lm) / sigma_hat)) +
geom_hline(yintercept = 0) +
geom_hline(yintercept = -2, linetype = 2) +
geom_hline(yintercept = 2, linetype = 2) +
geom_smooth() +
ylim(-3, 3)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Man kan observere, at effekten af division med \(\sqrt{1 - h_{ii}}\) er ganske lille, men alle observationerne flytter udad. Vi kan udtrække \(h_{ii}\) fra lm-objektet med hatvalues(), hvis vi vil.
range(1 / sqrt(1 - hatvalues(CAPM_lm)))
## [1] 1.040627 1.170057
For det her eksempel er det også værd at se på residualerne som funktion af tid, og lave et lag-plot for at se, om der er afhængighed.
qplot(1:20, rstandard(CAPM_lm)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
qplot(rstandard(CAPM_lm)[-20], rstandard(CAPM_lm)[-1]) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Lag-plottet indikerer, at fejlene er negativt korrelerede.
Vi kan også lave et QQ-plot for at undersøge normalfordelingsantagelsen. Med kun 20 observationer er det ganske svært at afvise normalfordelingen.
qqnorm(rstandard(CAPM_lm))
qqline(rstandard(CAPM_lm))
Konfidensintervaller
Vi kan til slut genfinde konfidensintervallerne fra EH ved at bruge confint() på lm-objektet. Den omsætter estimater og standard errors til standard konfidensintervaller baseret på den korrekte \(t\)-fordeling.
confint(CAPM_lm)
## 2.5 % 97.5 %
## (Intercept) -23.0895931 2.9221576
## M -0.5555732 0.8667294
## r -1.3808162 59.4223867
Reducerer vi modellen ved at fjerne interceptet fås et noget anderledes estimat, og en reduktion i standard errors for \(\hat{\gamma}\). Det skyldes primært at \(r\) varierer ganske lidt og interceptsøjlen og \(r\) kommer til at være tæt på hinanden. Om de ligefrem er tæt på at være kollineære er til diskussion.
CAPM_lm_0 <- lm(R ~ M + r - 1, data = CAPM)
summary(CAPM_lm_0)
##
## Call:
## lm(formula = R ~ M + r - 1, data = CAPM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1457 -3.1795 -0.7022 3.3169 9.2088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## M 0.09853 0.35052 0.281 0.782
## r 6.32644 4.07043 1.554 0.138
##
## Residual standard error: 6.685 on 18 degrees of freedom
## Multiple R-squared: 0.1799, Adjusted R-squared: 0.08875
## F-statistic: 1.974 on 2 and 18 DF, p-value: 0.1678
confint(CAPM_lm_0)
## 2.5 % 97.5 %
## M -0.6378792 0.8349428
## r -2.2252161 14.8780969
CAPM <- read_csv("CAPM.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Month = col_character(),
## r = col_double(),
## M = col_double(),
## R = col_double()
## )
Our main focus is the CAPM-inspired regression-model from EH example 11.13 of \[
R_i\equiv \alpha + \beta M_i + \gamma r_i + \epsilon_i,
\] for \(\epsilon_1,\ldots,\epsilon_N\) iid. \(\sim\mathcal{N}\mathopen{}\left({{0},{\sigma^2}}\right)\mathclose{}\) for some \(\sigma>0,\) such that \(R_1,\ldots,R_N\) are assumed independent normally distributed with variance homogeneity of \(\sigma^2,\) and mean \[
ER_i=\alpha+\beta M_i+\gamma r_i + 0.
\] Define \(\theta\equiv\mathopen{}\left({\alpha, \beta, \gamma}\right)\mathclose{}^T\) such that we may find a designmatrix \(A\) such that \[
R_i= A\theta+\epsilon\equiv \begin{pmatrix}{\alpha\\\beta\\\gamma}\end{pmatrix}A+\begin{pmatrix}{\epsilon_1\\\,\vdots\\\epsilon_N}\end{pmatrix}\equiv A_{,1}\alpha + A_{,2}\beta + A_{,3}\gamma + \begin{pmatrix}{\epsilon_1\\\,\vdots\\\epsilon_N}\end{pmatrix}
\] Note as \(\epsilon\in M_{N,1},\theta\in M_{3,1}\) that we thus need to have \(A\in M_{N,3}.\) Further, we will necessarily need to have \[
a_{1,j}=1,\quad a_{2,j}=M_j,\quad a_{3,j}=r_j,\quad \text{for } j=1,\ldots,N.
\] Consequently, our central assumption of \(R_1,\ldots,R_N\) independent normally distributed with variance homogeneity may thus for \(R:=\mathopen{}\left({R_1,\ldots,R_N}\right)\mathclose{}^T\) be written as \[
R\sim\mathcal{N}\mathopen{}\left({{\begin{pmatrix}{ER_1\\ \,\,\vdots\\ ER_N}\end{pmatrix}},{\sigma^2I}}\right)\mathclose{}\equiv\mathcal{N}\mathopen{}\left({{\begin{pmatrix}{\alpha+\beta M_1+\gamma r_1\\ \quad \quad \quad \vdots\\\alpha+\beta M_N+\gamma r_N}\end{pmatrix}},{\sigma^2I}}\right)\mathclose{}\equiv\mathcal{N}\mathopen{}\left({{A\theta},{\sigma^2I}}\right)\mathclose{}.
\] We may define the designmatrix in
N <- nrow(CAPM)
k <- 3
R <- CAPM$R
A <- matrix(c(rep(1,20),CAPM$M,CAPM$r), nrow = N, ncol = k) #bycol=T is default.
A
## [,1] [,2] [,3]
## [1,] 1 8.549 0.496
## [2,] 1 -0.173 0.510
## [3,] 1 4.199 0.532
## [4,] 1 -0.630 0.569
## [5,] 1 -0.565 0.543
## [6,] 1 5.106 0.581
## [7,] 1 7.012 0.543
## [8,] 1 -0.346 0.475
## [9,] 1 6.312 0.370
## [10,] 1 2.949 0.316
## [11,] 1 -1.556 0.340
## [12,] 1 2.432 0.316
## [13,] 1 8.631 0.309
## [14,] 1 0.123 0.315
## [15,] 1 10.174 0.314
## [16,] 1 -4.737 0.350
## [17,] 1 2.498 0.334
## [18,] 1 -5.460 0.346
## [19,] 1 0.921 0.351
## [20,] 1 5.283 0.371
A in As \(A\) is very much defined via its columns, we may use the cbind function to construct our matrix;
A <- cbind(rep(1,20),CAPM$M, CAPM$r)
We may then use the results of K10.21 of EH to conclude that the MLE of \(\theta\) may be found in as
MLETheta <- solve(t(A)%*%A)%*%t(A)%*%CAPM$R
Also by EH K10.21 the variance matrix of \(\hat\theta\) will be \(\sim\mathcal{N}\mathopen{}\left({{\theta},{\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}}}\right)\mathclose{},\) which we by EH K10.21 may find in through
sigmat2 <- as.numeric((t(R)%*%R-t(MLETheta)%*%t(A)%*%A%*%MLETheta)/(N-k))
## [1] 40.88448
thetaCovMat <- sigmat2*solve(t(A)%*%A)
## [,1] [,2] [,3]
## [1,] 38.0005763 -0.2149795 -85.5238319
## [2,] -0.2149795 0.1136148 -0.1766915
## [3,] -85.5238319 -0.1766915 207.6366120
Noting that defining \(\tilde\sigma^2=\)sigmat2 with as.numeric is rather important in accordance with not generating errors in the subsequent multiplication defining thetaCovMat.
With the above calculated distribution of \(\hat\theta\equiv\mathopen{}\left({\hat\alpha,\,\hat\beta,\,\hat\gamma}\right)\mathclose{}^T\) we may by EH L9.47 conclude that \[
\hat\beta\sim\mathcal{N}\mathopen{}\left({{\beta},{\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}_{2,2}}}\right)\mathclose{}\overset{\cdot}{=}\mathcal{N}\mathopen{}\left({{\beta},{0.1136148}}\right)\mathclose{},\\
\hat\gamma\sim\mathcal{N}\mathopen{}\left({{\gamma},{\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}_{3,3}}}\right)\mathclose{}\overset{\cdot}{=}\mathcal{N}\mathopen{}\left({{\gamma},{207.636612}}\right)\mathclose{}.
\] But more importantly \(\hat\beta+\hat\gamma=\mathopen{}\left({0,1,1}\right)\mathclose{}\hat\theta,\) such that we by K9.46 of EH may conclude that \[
\hat\beta+\hat\gamma=\mathopen{}\left({0,1,1}\right)\mathclose{}\hat\theta\sim\mathcal{N}\mathopen{}\left({{\mathopen{}\left({0,1,1}\right)\mathclose{}\theta},{\,\mathopen{}\left({0,1,1}\right)\mathclose{}\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}\begin{pmatrix}{0\\1\\1}\end{pmatrix}}}\right)\mathclose{}\\
=\mathcal{N}\mathopen{}\left({{\beta+\gamma},{\,\mathopen{}\left({0,1,1}\right)\mathclose{}\sigma^2\mathopen{}\left({A^TA}\right)\mathclose{}^{-1}\begin{pmatrix}{0\\1\\1}\end{pmatrix}}}\right)\mathclose{}
\] with the above variance matrix calculated in as
selector <- matrix(c(0,1,1),nrow=1)
bgvar <- as.numeric(selector%*%thetaCovMat%*%t(selector))
## [1] 207.3968
As such, if we wanted to write the distribution of \(\hat\beta+\hat\gamma\) on the form \(\hat\beta+\hat\gamma\sim\mathcal{N}\mathopen{}\left({{\beta+\gamma},{se^2}}\right)\mathclose{}\) we should choose \(\tilde{se}^2=\)bgvar\(=207.3968439,\) such that \(\tilde{se}=14.4012792.\)
Having calculated the central estimator for \(\sigma,\,\tilde\sigma=40.8844839\) writing \(se^2=\sigma^2\cdot a\approx\tilde\sigma^2\cdot a,\) we may find \(a\) as \[ a=\frac{se^2}{\sigma^2}\underset{\sim}{\overset{\cdot}{=}}\frac{207.3968439}{40.8844839}=5.0727519 \]
For \(X_1,X_2,X_3,X_4,X_5\) independent normally distributed, each with same variance \(\sigma^2>0\), and individual expectation given by \[ EX_1=\alpha_1,\quad EX_2=\alpha_1-\alpha_2,\quad EX_3 = \alpha_1-\alpha_2,\quad EX_4 = \alpha_1+\alpha_2,\quad EX_5 = \alpha_1+\alpha_2, \] for some \(\alpha_1,\alpha_2\in\mathbb{R}\) being unknown parameters.
Assume we have observed outcomes of the \(X\)’s such that \(X_i=x_i\) with \(x_i\) as assigned below
X <- c(-0.187,-1.731,-0.184,2.252,1.775)
Note that as \(X\equiv\left({X_1,\ldots,X_5}\right)^T\) is a boundling of five random variables and is perceived as a columnvector as such, heed that as \(\alpha\equiv\left({\alpha_1,\alpha_2}\right)^T\in M_{2,1}(\mathbb{R}),\) we will for \(A\alpha\) being our expected value vector (\(\in M_{5,1}(\mathbb{R})\)) need \(A\in M_{5,2}(\mathbb{R}).\)
Consequently, observe \[ A\alpha \equiv \begin{pmatrix} a_{1,1} & a_{1,2}\\ a_{2,1} & a_{2,2}\\ a_{3,1} & a_{3,2}\\ a_{4,1} & a_{4,2}\\ a_{5,1} & a_{5,2} \end{pmatrix}\begin{pmatrix}{\alpha_1\\\alpha_2}\end{pmatrix}\equiv\begin{pmatrix}{a_{1,1}\\a_{2,1}\\a_{3,1}\\a_{4,1}\\a_{5,1}}\end{pmatrix}\alpha_1+\begin{pmatrix}{a_{1,2}\\a_{2,2}\\a_{3,2}\\a_{4,2}\\a_{5,2}}\end{pmatrix}\alpha_2, \] such that as we are seeking \(\begin{pmatrix}{\alpha_1\\\alpha_1-\alpha_2\\\alpha_1-\alpha_2\\\alpha_1+\alpha_2\\\alpha_1+\alpha_2}\end{pmatrix}\equiv EX=A\alpha,\) we may read off the “coefficients” columns of \(\alpha_1\) and \(\alpha_2\) with \(a_{i,1}=1,\,i=1,\ldots,5,\) and \(a_{1,2}=0,\,a_{2,2}=-1,\,a_{3,2}=-1,\,a_{4,2}=1,\,a_{5,2}=1,\) such that \[ A:=\begin{pmatrix} 1&0\\ 1&-1\\ 1&-1\\ 1&1\\ 1&1, \end{pmatrix} \] will be our designated designmatrix.
In accordance with K10.21 in EH we may define our designmatrix in as
A <- matrix(c(1,0,1,-1,1,-1,1,1,1,1), nrow = 5, ncol = 2, byrow = T) #By col is default
By EH K10.21 we will have \(\hat\alpha = \left({A^TA}\right)^{-1}A^TX\) and as such we may define in
puA <- solve(t(A)%*%A)%*%t(A)
alphahat <- puA%*%X
## [,1]
## [1,] 0.3850
## [2,] 1.4855
and as such we may therefore conclude that the MLE for \(\alpha\), \(\hat\alpha\equiv\begin{pmatrix}{\hat\alpha_1\\\hat\alpha_2}\end{pmatrix}=\begin{pmatrix}{0.385\\1.4855}\end{pmatrix}.\)
By page 413 in EH for \(A\) being a \(N\times k\overset{\cdot}{=}5\times 2\) matrix we will by page 415 have that we may estimate \(\tilde\sigma^2\) on the basis of data as \[
\tilde{\sigma}^2=\frac{X^TX-\hat{\alpha}^TA^TA\hat{\alpha}}{N-k}\overset{\cdot}{=}\frac{X^TX-\hat{\alpha}^TA^TA\hat{\alpha}}{3}
\] which we may calculate in with
sigmatilde2 <- as.numeric((t(X)%*%X-t(alphahat)%*%t(A)%*%A%*%alphahat)/3) #does as.numeric change further calculations?
## [1] 0.5731163
By EH K10.21 we have that the MLE for expected value, \(\hat\alpha\) and MLE for variance, \(\hat\sigma^2\) are independent, and as such, as the centralized estimator for the variance \(\tilde{\sigma}^2:=N\frac{\hat\sigma^2}{N-k},\) we get by measurable transformations that we will also have independence of MLE for expected value, \(\hat\alpha,\) and \(\tilde\sigma^2.\)
By independence we get that the joint density of \(\hat\alpha\) and \(\tilde\sigma^2\) can be obtained throught the product of the marginal densities. Note in this regard that by K10.21
\[ \hat\alpha\sim\mathcal{N}(\alpha,\sigma^2\left({A^TA}\right)^{-1}) \] and \[ \hat\sigma^2\sim\frac{\sigma^2}{N}\chi^2_{N-k} \] as in \(\hat\sigma^2\) is \(\chi^2\) distributed with \(k\) degrees of freedom and “scale-parameter” \(\frac{\sigma^2}{N}\) which in itself is to be interpreted in the sense that the \(\frac{1}{\frac{\sigma^2}{N}}\hat\sigma^2\equiv\frac{N}{\sigma^2}\hat\sigma^2\sim\chi^2_{N-k}\) is “proper” \(\chi^2\) distributed with \(N-k\) degrees of freedom. Consequently as \(\tilde{\sigma}^2:=N\frac{\hat\sigma^2}{N-k},\) we will therefore have that \[ \tilde\sigma^2\sim N\frac{\frac{\sigma^2}{N}}{N-k}\chi^2_{N-k}\equiv\frac{\sigma^2}{N-k}\chi^2_{N-k}. \] The density for some \(Q\sim\chi_q^2\) is \(\frac{1}{2^{\frac{q}{2}}\Gamma(\frac{q}{2})}x^{\frac{q}{2}-1}e^{-\frac{x}{2}}\) such that we will have that the density of \(\tilde\sigma^2\) is going to be \[\begin{align*} f_{\tilde\sigma^2}(x)\equiv f_{\frac{\sigma^2}{N-k}\chi_{N-k}^2}(x)&=\frac{N-k}{\sigma^2}\frac{1}{2^{\frac{N-k}{2}}\Gamma(\frac{N-k}{2})}x^{\frac{N-k}{2}-1}e^{-\frac{x}{2}}\overset{\cdot}{=}\frac{5-2}{\sigma^2}\frac{1}{2^{\frac{5-2}{2}}\Gamma(\frac{5-2}{2})}x^{\frac{5-2}{2}-1}e^{-\frac{x}{2}}\\ &\equiv \frac{3}{\sigma^2}\frac{1}{2^{\frac{3}{2}}\Gamma(\frac{3}{2})}x^{\frac{3}{2}-1}e^{-\frac{x}{2}}, \end{align*}\] hence, as for \(f_{\hat\alpha}\) being the density for \(\hat\alpha\) we get that the joint density of \(\hat\alpha\) and \(\tilde\sigma^2,\) \(f_{\hat\alpha,\tilde\sigma^2}\) will be \[ f_{\hat\alpha,\tilde\sigma^2}=f_{\hat\alpha}(x)\cdot f_{\tilde\sigma^2}(x). \]
We may deploy one of two methods, in order to calculate confidence intervals for \(\alpha_1,\alpha_2.\)
Note that we by a) may write \[ X\equiv\left({X_1,\ldots,X_5}\right)^T\sim\mathcal{N}\left({{A\alpha},{\sigma^2I}}\right). \] Define \(\varphi_1:\mathbb{R}^2\rightarrow\mathbb{R},\,\varphi_2:\mathbb{R}^2\rightarrow\mathbb{R},\) for \(\omega\equiv\begin{pmatrix}{\omega_1\\\omega_2}\end{pmatrix}\in\mathbb{R}^2\) as the coordinate projections \(\varphi_1(\omega):=\omega_1,\,\varphi_2(\omega):=\omega_2\) respectively. For \(e_1:=\begin{pmatrix}{1\\0}\end{pmatrix},\,e_2:=\begin{pmatrix}{0\\1}\end{pmatrix},\) observe \[ \varphi_1(\omega)=e_1^T\omega=\omega_1,\quad\varphi_2(\omega)=e_2^T\omega=\omega_2. \] \(1-\alpha^*\) confidence intervals for \(\varphi_1(\alpha)=\alpha_1,\,\varphi_2(\alpha)=\alpha_2\) may be obtained as \[\begin{align} KI_{\alpha_1}&=\left({e_1^T\hat\alpha-\sqrt{e_1^T\left({A^TA}\right)^{-1}e_1f_{\alpha^*}\tilde\sigma^2},\,e_1^T\hat\alpha+\sqrt{e_1^T\left({A^TA}\right)^{-1}e_1f_{\alpha^*}\tilde\sigma^2}}\right)\\ KI_{\alpha_2}&=\left({e_2^T\hat\alpha-\sqrt{e_2^T\left({A^TA}\right)^{-1}e_2f_{\alpha^*}\tilde\sigma^2},\,e_2^T\hat\alpha+\sqrt{e_2^T\left({A^TA}\right)^{-1}e_2f_{\alpha^*}\tilde\sigma^2}}\right) \end{align}\] by EH Example 10.30, with \(f_{\alpha^*}\) being the \(1-\alpha^*\) quantile for the \(F\)-distribution with \(\left({1,N-k}\right)\) degrees of freedom, for \(A\in M_{N,k}\overset{\cdot}{=}M_{5,2}\) of rank \(k\overset{\cdot}{=}2\).
Having looked through ?qf, an implementation in for \(\alpha^*=0.05\) may be obtained with
falphastar <- qf(p = 0.95, df1 = 1, df2 = nrow(A)-ncol(A)) #f_{\alpha^*}
e_1 <- c(1,0)
e_2 <- c(0,1)
sqhelper_1 <- as.numeric(sqrt(t(e_1)%*%solve(t(A)%*%A)%*%e_1*falphastar*sigmatilde2))
## [1] 1.077451
sqhelper_2 <- as.numeric(sqrt(t(e_2)%*%solve(t(A)%*%A)%*%e_2*falphastar*sigmatilde2))
## [1] 1.204627
KI_alpha1_1 <- c(e_1%*%alphahat-sqhelper_1, e_1%*%alphahat+sqhelper_1)
## [1] -0.6924509 1.4624509
KI_alpha2_1 <- c(e_2%*%alphahat-sqhelper_2, e_2%*%alphahat+sqhelper_2)
## [1] 0.2808733 2.6901267
Other methods to define the sqhelper’s
Note that we could’ve also defined the sqhelper’s with
sqrt(solve(t(A)%*%A)[1,1]*falphastar*sigmatilde2)
## [1] 1.077451
sqrt(solve(t(A)%*%A)[2,2]*falphastar*sigmatilde2)
## [1] 1.204627
as \(e_i^TBe_i=b_{ii}\) for any standard basisvector \(e_i\) of appropriate size to be multiplied with some matrix \(B.\) Finally we might also choose to forego the selection of diagonal element untill it is needed;
sqhelper <- sqrt(solve(t(A)%*%A)*falphastar*sigmatilde2)
sqhelper[1,1]
## [1] 1.077451
sqhelper[2,2]
## [1] 1.204627
As \[ \hat\alpha\sim\mathcal{N}(\alpha,\sigma^2\left({A^TA}\right)^{-1})\overset{\cdot}{=}\mathcal{N}\left({{\begin{pmatrix}{\alpha_1\\\alpha_2}\end{pmatrix}},{\begin{pmatrix} \frac{\sigma^2}{5}&0\\ 0&\frac{\sigma^2}{4}\end{pmatrix}}}\right) \] we get by EH L9.47 that \[ \hat\alpha_1\sim\mathcal{N}\left({{\alpha_1},{\frac{\sigma^2}{5}}}\right),\quad\hat\alpha_2\sim\mathcal{N}\left({{\alpha_2},{\frac{\sigma^2}{4}}}\right). \] By the likes of “Introduktion til Statistik” T4.7 we may note that \[ \hat\alpha_1\pm t_{n-1,1-\frac{\alpha^*}{2}}SE(\hat\alpha_1),\quad\hat\alpha_2\pm t_{n-1,1-\frac{\alpha^*}{2}}SE(\hat\alpha_2), \] will be \(1-\alpha^*\) confidence intervals for \(\alpha_1,\alpha_2\) respectively. In our case, we have \(n=5\) datapoints, and are seeking \(95\%\)-confidence intervals such that \(\alpha^*=0.05.\)
talphastar <- qt(0.975,4) #2.776445
KI_alpha1_2 <- c(alphahat[1]-talphastar*sqrt(sigmatilde2*solve(t(A)%*%A)[1,1]), alphahat[1]+talphastar*sqrt(sigmatilde2*solve(t(A)%*%A)[1,1]))
## [1] -0.5549949 1.3249949
KI_alpha2_2 <- c(alphahat[2]-talphastar*sqrt(sigmatilde2*solve(t(A)%*%A)[2,2]), alphahat[2]+talphastar*sqrt(sigmatilde2*solve(t(A)%*%A)[2,2]))
## [1] 0.4345538 2.5364462
We can import the data with
EHopg11dot3 <- read_table2("EHopg11dot3.txt",
col_types = cols(Alder = col_integer()))
## Warning: 19 parsing failures.
## row col expected actual file
## 2 -- 3 columns 4 columns 'EHopg11dot3.txt'
## 3 -- 3 columns 4 columns 'EHopg11dot3.txt'
## 4 -- 3 columns 4 columns 'EHopg11dot3.txt'
## 5 -- 3 columns 4 columns 'EHopg11dot3.txt'
## 6 -- 3 columns 4 columns 'EHopg11dot3.txt'
## ... ... ......... ......... .................
## See problems(...) for more details.
We may subset the data according to profession with
jour <- subset(EHopg11dot3, EHopg11dot3$Fag == "J")
uni <- subset(EHopg11dot3, EHopg11dot3$Fag == "U")
We may then plot bloodpressure across from age in each of the two groups;
plot(jour$Alder, jour$Blodtryk, ylab = "Bloodpressure of journalists (mm Hg)", xlab = "Years of age")
plot(uni$Alder, uni$Blodtryk, ylab = "Bloodpressure of University teachers (mm Hg)", xlab = "Years of age")
Looking at the plots, we figure that as a function of the low number of data points, it doesn’t look unreasonable to assume that there might be a linear relation between bloodpressure and age.
We assume \(Y_1,\ldots,Y_n \overset{\text{iid.}}{\sim}\mathcal{N}(\alpha+\beta x_i,\sigma^2),\,\)with \(\alpha,\beta\in\mathbb{R},\,\sigma^2>0\) being unknown parameters.
for each of the two groupings. Implementing this in and pearing at the summary is done through
jourmodel <- lm(jour$Blodtryk ~ jour$Alder)
summary(jourmodel)
##
## Call:
## lm(formula = jour$Blodtryk ~ jour$Alder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.644 -2.881 1.274 1.849 7.219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.9952 7.5966 11.19 2.38e-07 ***
## jour$Alder 1.5296 0.1325 11.54 1.74e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.26 on 11 degrees of freedom
## Multiple R-squared: 0.9237, Adjusted R-squared: 0.9168
## F-statistic: 133.2 on 1 and 11 DF, p-value: 1.737e-07
just as well as we might draw estimates for \(\alpha_{j},\,\beta_j\) out of with
alphaj <- jourmodel$coefficients[[1]]
## [1] 84.99522
betaj <- jourmodel$coefficients[[2]]
## [1] 1.529568
We copy the procedure of subproblem b;
unimodel <- lm(uni$Blodtryk ~ uni$Alder)
summary(unimodel)
##
## Call:
## lm(formula = uni$Blodtryk ~ uni$Alder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.248 -1.673 1.127 1.602 6.814
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.6299 10.2884 7.351 5.58e-06 ***
## uni$Alder 1.5624 0.1817 8.597 1.01e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.41 on 13 degrees of freedom
## Multiple R-squared: 0.8504, Adjusted R-squared: 0.8389
## F-statistic: 73.91 on 1 and 13 DF, p-value: 1.008e-06
just as well as we might draw estimates for \(\alpha_{j},\,\beta_j\) out of with
alphau <- unimodel$coefficients[[1]]
## [1] 75.62986
betau <- unimodel$coefficients[[2]]
## [1] 1.562384
We may implement an aggregate linear regression model for bloodpressure across the two groupings, with each group responsible for a set of parameters. Note however that the standard model <- lm(EHopg11dot3$Blodtryk ~ jour$Alder+uni$Alder) won’t work, as we are asking to model the \(28\) datapoints of the dataset based on \(13\) and \(15\) datapoints respectively. Instead note that what we desire is of the format \[
EX_i=1_J(i)\mathopen{}\left({\alpha_J+\beta_Jt_i}\right)\mathclose{}+1_U(i)\mathopen{}\left({\alpha_U+\beta_Ut_i}\right)\mathclose{}
\] As such we may then implement the indicator functions \(1_U,1_J\) in the estimates “via the data,” which we will do by creating the matrix
A <-matrix(c(rep(1,13),rep(0,15), jour$Alder,rep(0,15),rep(0,13),rep(1,15),rep(0,13),uni$Alder),nrow=28)
A
## [,1] [,2] [,3] [,4]
## [1,] 1 68 0 0
## [2,] 1 62 0 0
## [3,] 1 49 0 0
## [4,] 1 62 0 0
## [5,] 1 34 0 0
## [6,] 1 66 0 0
## [7,] 1 45 0 0
## [8,] 1 60 0 0
## [9,] 1 57 0 0
## [10,] 1 57 0 0
## [11,] 1 63 0 0
## [12,] 1 56 0 0
## [13,] 1 57 0 0
## [14,] 0 0 1 55
## [15,] 0 0 1 49
## [16,] 0 0 1 60
## [17,] 0 0 1 54
## [18,] 0 0 1 58
## [19,] 0 0 1 51
## [20,] 0 0 1 43
## [21,] 0 0 1 52
## [22,] 0 0 1 53
## [23,] 0 0 1 61
## [24,] 0 0 1 61
## [25,] 0 0 1 57
## [26,] 0 0 1 57
## [27,] 0 0 1 70
## [28,] 0 0 1 63
We will thus be able to do;
modelgroups <- lm(EHopg11dot3$Blodtryk ~ A-1)
summary(modelgroups)
##
## Call:
## lm(formula = EHopg11dot3$Blodtryk ~ A - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.644 -3.019 1.201 1.785 7.219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## A1 84.9952 7.7426 10.978 7.71e-11 ***
## A2 1.5296 0.1351 11.322 4.12e-11 ***
## A3 75.6299 10.1296 7.466 1.05e-07 ***
## A4 1.5624 0.1789 8.732 6.47e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.342 on 24 degrees of freedom
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9993
## F-statistic: 1.045e+04 on 4 and 24 DF, p-value: < 2.2e-16
We may recognize the already calculated estimates in the summary.
The estimates might be found from model in via
modelgroups$coefficients[[1]]
modelgroups$coefficients[[2]]
modelgroups$coefficients[[3]]
modelgroups$coefficients[[4]]
We may then finally do a full linear regression model based on the entire dataset, with no splitting into groups;
model <- lm(EHopg11dot3$Blodtryk~EHopg11dot3$Alder)
The estimates may once again be collected via
alpha <- model$coefficients[[1]]
alpha
## [1] 79.6603
beta <- model$coefficients[[2]]
beta
## [1] 1.552729
READ ’AN INTRODUCTION TO GENERAL LINEAR MODELS CHAPTER 6.4
As there is no data to read in, we may write it ourselves;
(data <- tibble(
koag = c(62,60,63,59, 63,67,71,64,65,66, 68,66,71,67,68,68, 56,62,60,61,63,64,63,59),
Diaet = factor(c(rep("A",4),rep("B", 6), rep("C",6), rep("D",8)), levels = c("A","B", "C", "D"))))
## # A tibble: 24 x 2
## koag Diaet
## <dbl> <fct>
## 1 62 A
## 2 60 A
## 3 63 A
## 4 59 A
## 5 63 B
## 6 67 B
## 7 71 B
## 8 64 B
## 9 65 B
## 10 66 B
## # ... with 14 more rows
The linear model for a factor trial of the data with the Diaet factor \(F:\mathopen{}\left\{{1,\ldots,24}\right\}\mathclose{}\rightarrow\mathopen{}\left\{{A,B,C,D}\right\}\mathclose{},\) is the following
We may start off with doing a visual assesment of the variance and normality assumptions via a standardized residual and QQ plot, as was done in HS16.
Stdresplot <- function(model, main = paste("(Estimate, Std. Res.)-plot of", deparse(substitute(model))), ylab ="Standardized residuals", ...) {
fit <- fitted(model)
rst <- rstandard(model)
qplot(fit, rst, main = main, ylab = ylab, ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) )+geom_hline(yintercept = 0) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
}
QQplotdraw <- function(model, main = paste("Normal QQ-plot of", deparse(substitute(model))), xlab = "Theoretical Quantiles", ylab ="Sample Quantiles", ...) {
rst <- rstandard(model)
#dataname <- getCall(lm_LT)$data
ggplot(data = eval(getCall(model)$data), main = main, xlab = xlab, ylab = ylab) + geom_qq() + geom_qq_line() + aes(sample = rst)
} #main, xlab, ylab call do not work for some reason
StdresQQPlot <- function(model,...) {
p1 <- Stdresplot(model,...)
p2 <- QQplotdraw(model,...)
library(gridExtra)
grid.arrange(p1,p2, ncol = 2)
}
lmD <- lm(koag ~ Diaet - 1, data = data)
summary(lmD)
##
## Call:
## lm(formula = koag ~ Diaet - 1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.00 -1.25 0.00 1.25 5.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## DiaetA 61.0000 1.1832 51.55 <2e-16 ***
## DiaetB 66.0000 0.9661 68.32 <2e-16 ***
## DiaetC 68.0000 0.9661 70.39 <2e-16 ***
## DiaetD 61.0000 0.8367 72.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.366 on 20 degrees of freedom
## Multiple R-squared: 0.9989, Adjusted R-squared: 0.9986
## F-statistic: 4399 on 4 and 20 DF, p-value: < 2.2e-16
StdresQQPlot(lmD)
The std. res. plot gives no indication that the variance assumptions are off, given the limited amount of data. See the likes of HS2 for an in-depth description of what to be looking for. Looking at the QQplot, we see very little that would suggest, that that the normality assumption is off.
Having established the framework, we can clarify that the estimated mean of the groups \(A,B,C,D\) are \(61,66,68,61.\) Note also that by (12.7), the estimates for the mean of each group are by the model to be normally distributed with a central mean estimate, and a variance equal to the total variance of the data, divided by the number of observations for the particular group. As such we get that the variances can be calculated by
(sig2 <- (summary(lmD)$sigma)^2)
## [1] 5.6
sig2/(count(data, Diaet)[[2]])
## [1] 1.4000000 0.9333333 0.9333333 0.7000000
We see that we have pull out the cental variance parameter of the model in subproblem two. By EH page 501, we see that the cental estimate of variance according to the model will be \(\chi^2\) distributed with \(|I|-|F| \overset{\cdot}{=}24-4=20\) degrees of freedom, and scale parameter \(\frac{\sigma^2}{|I|-|F|},\) once again with \(\sigma^2\) being the true variance.
We may test whether there is any difference between the groups, ie. test whether the mean value vector is in \(L_1.\)
lm_L1 <- lm(koag ~ 1, data)
(an <- anova(lm_L1, lmD))
## Analysis of Variance Table
##
## Model 1: koag ~ 1
## Model 2: koag ~ Diaet - 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 340
## 2 20 112 3 228 13.571 4.658e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Of which we in particular see that that we get ab estimated \(F\) value of \(13.571\), and therefore a \(p-value\) of \(\cong 4.66\cdot 10^{-5},\) with we consider low enough to warrent tossing a hypothesis of the groups being distributed the same.
We may use the command
confint(lmD)
## 2.5 % 97.5 %
## DiaetA 58.53185 63.46815
## DiaetB 63.98477 68.01523
## DiaetC 65.98477 70.01523
## DiaetD 59.25476 62.74524
to see that a \(95\%\) confidence interval for blod clodding for patients in diaet group A is \(\mathopen{}\left({58.53185, 63.46815}\right)\mathclose{}.\)
We might create a new data set with only the \(A\)-data, and do an analysis on this;
diA <- subset(data, Diaet == "A") #remember the double '=='
lm_A <- lm(koag~1, data = diA)
summary(lm_A)
##
## Call:
## lm(formula = koag ~ 1, data = diA)
##
## Residuals:
## 1 2 3 4
## 1 -1 2 -2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.0000 0.9129 66.82 7.39e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.826 on 3 degrees of freedom
confint(m2)
## 2.5 % 97.5 %
## (Intercept) -4.423050 33.7966121
## Midarm -0.486795 0.8856523
sigma(m2)^2
## [1] 26.96322
For the two confidence for diaet A we see that they are; \(\mathopen{}\left({58.53185, 63.46815}\right)\mathclose{},\,\mathopen{}\left({58.09484,63.90516}\right)\mathclose{},\) we see that they are fairly similar.\
We may do the same with diaet D:
diD <- subset(data, Diaet == "D")
lm_D <- lm(koag~1, data = diD)
confint(lmD)
## 2.5 % 97.5 %
## DiaetA 58.53185 63.46815
## DiaetB 63.98477 68.01523
## DiaetC 65.98477 70.01523
## DiaetD 59.25476 62.74524
confint(lm_D)
## 2.5 % 97.5 %
## (Intercept) 58.81078 63.18922
We may get create the data as was done in EH problem 12.3:
(EH131 <- tibble(
vaegt = c(196,269,248,130,149,195,148,147,159,470,172,292,214,349,253,338,345,266,360,224,312,409,479,292,443,386,381,388,349,458,470,306),
vand = factor(c(rep("A",4),rep("B", 4), rep("C",4), rep("D",4)) %>% rep(.,2) %>% c(.), levels = c("A","B","C","D")),
goed = factor(c(rep("I", 16), rep("II", 16)))))
## # A tibble: 32 x 3
## vaegt vand goed
## <dbl> <fct> <fct>
## 1 196 A I
## 2 269 A I
## 3 248 A I
## 4 130 A I
## 5 149 B I
## 6 195 B I
## 7 148 B I
## 8 147 B I
## 9 159 C I
## 10 470 C I
## # ... with 22 more rows
We might do a one sided variance analysis for the product factor \(G \times V.\) Noting the interpretation of as one sided analysis of variance with a product factor as presented in EH page 521-522, of the model that every combination of the involved labels have their own mean, such that we might ask if it is the case that all treatments have the same distribution (same mean, as we assume that data \(X_i\sim\mathcal{N}\mathopen{}\left({{\xi_i},{\sigma^2}}\right)\mathclose{}).\)
We may thus start off with doing model-control on the variance homogeneity and normal-distribution assumptions of the model, in order to facilitate the background for the anova framework. We may thus turn to the standard visual model validation methods of standardized residual and QQ plots, as was also done in HS 16 and EH 12.3
Note in particular that we in implement the
lm model of the product factor as a product in the following sense: Should we, or should we not do -1?!!!
lmEH131 <- lm(vaegt ~ vand * goed -1, data = EH131)
model.matrix(lmEH131)
## vandA vandB vandC vandD goedII vandB:goedII vandC:goedII vandD:goedII
## 1 1 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0 0
## 3 1 0 0 0 0 0 0 0
## 4 1 0 0 0 0 0 0 0
## 5 0 1 0 0 0 0 0 0
## 6 0 1 0 0 0 0 0 0
## 7 0 1 0 0 0 0 0 0
## 8 0 1 0 0 0 0 0 0
## 9 0 0 1 0 0 0 0 0
## 10 0 0 1 0 0 0 0 0
## 11 0 0 1 0 0 0 0 0
## 12 0 0 1 0 0 0 0 0
## 13 0 0 0 1 0 0 0 0
## 14 0 0 0 1 0 0 0 0
## 15 0 0 0 1 0 0 0 0
## 16 0 0 0 1 0 0 0 0
## 17 1 0 0 0 1 0 0 0
## 18 1 0 0 0 1 0 0 0
## 19 1 0 0 0 1 0 0 0
## 20 1 0 0 0 1 0 0 0
## 21 0 1 0 0 1 1 0 0
## 22 0 1 0 0 1 1 0 0
## 23 0 1 0 0 1 1 0 0
## 24 0 1 0 0 1 1 0 0
## 25 0 0 1 0 1 0 1 0
## 26 0 0 1 0 1 0 1 0
## 27 0 0 1 0 1 0 1 0
## 28 0 0 1 0 1 0 1 0
## 29 0 0 0 1 1 0 0 1
## 30 0 0 0 1 1 0 0 1
## 31 0 0 0 1 1 0 0 1
## 32 0 0 0 1 1 0 0 1
## attr(,"assign")
## [1] 1 1 1 1 2 3 3 3
## attr(,"contrasts")
## attr(,"contrasts")$vand
## [1] "contr.treatment"
##
## attr(,"contrasts")$goed
## [1] "contr.treatment"
lmEH131_1 <- lm(vaegt ~ 1, data = EH131)
anova(lmEH131_1, lmEH131)
## Analysis of Variance Table
##
## Model 1: vaegt ~ 1
## Model 2: vaegt ~ vand * goed - 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 31 357537
## 2 24 146008 7 211529 4.9671 0.001392 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lmEH131)
##
## Call:
## lm(formula = vaegt ~ vand * goed - 1, data = EH131)
##
## Residuals:
## Min 1Q Median 3Q Max
## -114.25 -50.31 -11.62 47.06 196.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## vandA 210.75 39.00 5.404 1.50e-05 ***
## vandB 159.75 39.00 4.096 0.000413 ***
## vandC 273.25 39.00 7.007 3.03e-07 ***
## vandD 288.50 39.00 7.398 1.23e-07 ***
## goedII 88.00 55.15 1.596 0.123671
## vandB:goedII 125.25 78.00 1.606 0.121394
## vandC:goedII 38.25 78.00 0.490 0.628306
## vandD:goedII 19.25 78.00 0.247 0.807160
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78 on 24 degrees of freedom
## Multiple R-squared: 0.9549, Adjusted R-squared: 0.9398
## F-statistic: 63.48 on 8 and 24 DF, p-value: 2.854e-14
summary(lm(vaegt ~ vand * goed, data = EH131))
##
## Call:
## lm(formula = vaegt ~ vand * goed, data = EH131)
##
## Residuals:
## Min 1Q Median 3Q Max
## -114.25 -50.31 -11.62 47.06 196.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.75 39.00 5.404 1.5e-05 ***
## vandB -51.00 55.15 -0.925 0.364
## vandC 62.50 55.15 1.133 0.268
## vandD 77.75 55.15 1.410 0.171
## goedII 88.00 55.15 1.596 0.124
## vandB:goedII 125.25 78.00 1.606 0.121
## vandC:goedII 38.25 78.00 0.490 0.628
## vandD:goedII 19.25 78.00 0.247 0.807
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78 on 24 degrees of freedom
## Multiple R-squared: 0.5916, Adjusted R-squared: 0.4725
## F-statistic: 4.967 on 7 and 24 DF, p-value: 0.001392
DIFFERENCE BETWEEN ONE WAY MULTIFACTOR ANALYSIS OF VARIANCE IN R, AND TWO WAY MULTIFACTOR ANALYSIS OF VARIANCE IN R
The linear model based on the product factor can be described by the following
We may load in the data;
opgave13_4 <- read_table2("opgave13_4.txt",
col_types = cols(Blander = col_factor(levels = c("1",
"2", "3")), Bryder = col_factor(levels = c("1",
"2", "3"))))
Note in this regard that one could, in accordance with matching the labels of the Blander with the labels used in the problem statement, mutate the data set.
We may implement the model in with
model134 <- lm(Respons ~ Blander*Bryder, data = opgave13_4)
summary(model134)
##
## Call:
## lm(formula = Respons ~ Blander * Bryder, data = opgave13_4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -655.0 -347.5 0.0 290.0 1210.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5340.0 262.0 20.379 <2e-16 ***
## Blander2 -295.0 370.6 -0.796 0.433
## Blander3 335.0 370.6 0.904 0.374
## Bryder2 -350.0 370.6 -0.944 0.353
## Bryder3 -525.0 370.6 -1.417 0.168
## Blander2:Bryder2 650.0 524.1 1.240 0.226
## Blander3:Bryder2 90.0 524.1 0.172 0.865
## Blander2:Bryder3 -5.0 524.1 -0.010 0.992
## Blander3:Bryder3 -232.5 524.1 -0.444 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 524.1 on 27 degrees of freedom
## Multiple R-squared: 0.3542, Adjusted R-squared: 0.1628
## F-statistic: 1.851 on 8 and 27 DF, p-value: 0.1108
We may do the standard visual check of the model assumptions regarding variance, and normality:
Stdresplot <- function(model, main = paste("(Estimate, Std. Res.)-plot of", deparse(substitute(model))), ylab ="Standardized residuals", ...) {
fit <- fitted(model)
rst <- rstandard(model)
qplot(fit, rst, main = main, ylab = ylab, ylim = c(-max(3.2,max(abs(rst))), max(3.2,max(abs(rst)))) )+geom_hline(yintercept = 0) #Largest symmetric interval (around 0) of (-3.2,3.2) or (-largest absolute rst, largest absolute rst)
}
QQplotdraw <- function(model, main = paste("Normal QQ-plot of", deparse(substitute(model))), xlab = "Theoretical Quantiles", ylab ="Sample Quantiles", ...) {
rst <- rstandard(model)
#dataname <- getCall(lm_LT)$data
ggplot(data = eval(getCall(model)$data), main = main, xlab = xlab, ylab = ylab) + geom_qq() + geom_qq_line() + aes(sample = rst)
} #main, xlab, ylab call do not work for some reason
StdresQQPlot <- function(model,...) {
p1 <- Stdresplot(model,...)
p2 <- QQplotdraw(model,...)
library(gridExtra)
grid.arrange(p1,p2, ncol = 2)
}
StdresQQPlot(model134)
which for the most part (a couple of lower valued points seem to miss the trend in the QQplot) look reasonable.
We may also get the central variance estimate from the model with
summary(model134)
##
## Call:
## lm(formula = Respons ~ Blander * Bryder, data = opgave13_4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -655.0 -347.5 0.0 290.0 1210.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5340.0 262.0 20.379 <2e-16 ***
## Blander2 -295.0 370.6 -0.796 0.433
## Blander3 335.0 370.6 0.904 0.374
## Bryder2 -350.0 370.6 -0.944 0.353
## Bryder3 -525.0 370.6 -1.417 0.168
## Blander2:Bryder2 650.0 524.1 1.240 0.226
## Blander3:Bryder2 90.0 524.1 0.172 0.865
## Blander2:Bryder3 -5.0 524.1 -0.010 0.992
## Blander3:Bryder3 -232.5 524.1 -0.444 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 524.1 on 27 degrees of freedom
## Multiple R-squared: 0.3542, Adjusted R-squared: 0.1628
## F-statistic: 1.851 on 8 and 27 DF, p-value: 0.1108
sigma(model134)^2
## [1] 274647.2
In the case of the one-sided analysis of the product factor we might, as in EH 12.4 note that by EH p. 501, the central estimate for the variance, will be \(\chi^2\) distributed, with \(\mathopen{}\left|{I}\right|\mathclose{}-\mathopen{}\left|{Br\times Bl}\right|\mathclose{}\overset{\cdot}{=}36-3\cdot 3 = 27\) degrees of freedom, and scale parameter \(\frac{\sigma^2}{\mathopen{}\left|{I}\right|\mathclose{}-\mathopen{}\left|{Br\times Bl}\right|\mathclose{}}\overset{\cdot}{=}\frac{\sigma^2}{27}\), such that \(\frac{27}{\sigma^2}\tilde{\sigma}^2\sim\chi^2_{\mathopen{}\left|{I}\right|\mathclose{}-\mathopen{}\left|{Br\times Bl}\right|\mathclose{}}\)
The canonical static for a test of the additive model against the full model is the likelihood ratio test static, that might be found on page 111 of BMS.
Making the additive model in :
add <- lm(Respons ~ 0+Blander + Bryder, data = opgave13_4)
summary(add)
##
## Call:
## lm(formula = Respons ~ 0 + Blander + Bryder, data = opgave13_4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -840.83 -291.67 -20.83 242.29 1099.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Blander1 5284.2 190.3 27.774 < 2e-16 ***
## Blander2 5204.2 190.3 27.353 < 2e-16 ***
## Blander3 5571.7 190.3 29.285 < 2e-16 ***
## Bryder2 -103.3 208.4 -0.496 0.62353
## Bryder3 -604.2 208.4 -2.899 0.00682 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 510.5 on 31 degrees of freedom
## Multiple R-squared: 0.9915, Adjusted R-squared: 0.9902
## F-statistic: 726.1 on 5 and 31 DF, p-value: < 2.2e-16
model.matrix(add)
## Blander1 Blander2 Blander3 Bryder2 Bryder3
## 1 1 0 0 0 0
## 2 1 0 0 0 0
## 3 1 0 0 0 0
## 4 1 0 0 0 0
## 5 1 0 0 1 0
## 6 1 0 0 1 0
## 7 1 0 0 1 0
## 8 1 0 0 1 0
## 9 1 0 0 0 1
## 10 1 0 0 0 1
## 11 1 0 0 0 1
## 12 1 0 0 0 1
## 13 0 1 0 0 0
## 14 0 1 0 0 0
## 15 0 1 0 0 0
## 16 0 1 0 0 0
## 17 0 1 0 1 0
## 18 0 1 0 1 0
## 19 0 1 0 1 0
## 20 0 1 0 1 0
## 21 0 1 0 0 1
## 22 0 1 0 0 1
## 23 0 1 0 0 1
## 24 0 1 0 0 1
## 25 0 0 1 0 0
## 26 0 0 1 0 0
## 27 0 0 1 0 0
## 28 0 0 1 0 0
## 29 0 0 1 1 0
## 30 0 0 1 1 0
## 31 0 0 1 1 0
## 32 0 0 1 1 0
## 33 0 0 1 0 1
## 34 0 0 1 0 1
## 35 0 0 1 0 1
## 36 0 0 1 0 1
## attr(,"assign")
## [1] 1 1 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$Blander
## [1] "contr.treatment"
##
## attr(,"contrasts")$Bryder
## [1] "contr.treatment"
anova(add, model134)
## Analysis of Variance Table
##
## Model 1: Respons ~ 0 + Blander + Bryder
## Model 2: Respons ~ Blander * Bryder
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 31 8079308
## 2 27 7415475 4 663833 0.6043 0.6629
Hvor bliver Bryder1 af? Note that a \(p\)-value of \(0.66\) is sufficiently high that we can not throw out the hypothesis that there might be independence of effects (vi kan ikke afkaste hypotesen om at der ingen vekselvirkning er)
We may visualise this with
ggplot(opgave13_4, aes(x = Blander, y = Respons, color = Bryder, group = Bryder)) +
geom_point() +
geom_line(stat = "summary", fun = "mean", size = 1) +
xlab("Blander") +
ylab("Knæktryk") +
scale_color_discrete("Bryder")
This doesn’t look terribly well for the green line - how could we work on with the hypothesis of additivity?!?!
If we were to continue on with the hypothesis of additivity, we would be able to analyse whether the Blander and Bryder factors have an effect, by themselves. Doing this in :
lmBl <- lm(Respons ~ Blander-1, opgave13_4)
summary(lmBl)
##
## Call:
## lm(formula = Respons ~ Blander - 1, data = opgave13_4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -888.33 -458.33 7.92 319.79 1231.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Blander1 5048.3 163.5 30.88 <2e-16 ***
## Blander2 4968.3 163.5 30.39 <2e-16 ***
## Blander3 5335.8 163.5 32.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 566.4 on 33 degrees of freedom
## Multiple R-squared: 0.9889, Adjusted R-squared: 0.9879
## F-statistic: 980.7 on 3 and 33 DF, p-value: < 2.2e-16
lmBr <- lm(Respons ~ Bryder-1, opgave13_4)
summary(lmBr)
##
## Call:
## lm(formula = Respons ~ Bryder - 1, data = opgave13_4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -933.33 -310.00 -31.25 337.50 950.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Bryder1 5353.3 150.6 35.56 <2e-16 ***
## Bryder2 5250.0 150.6 34.87 <2e-16 ***
## Bryder3 4749.2 150.6 31.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 521.5 on 33 degrees of freedom
## Multiple R-squared: 0.9906, Adjusted R-squared: 0.9897
## F-statistic: 1158 on 3 and 33 DF, p-value: < 2.2e-16
as such we might do individual anovas between there models and the full model.
anova(lmBl, model134)
## Analysis of Variance Table
##
## Model 1: Respons ~ Blander - 1
## Model 2: Respons ~ Blander * Bryder
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 33 10585425
## 2 27 7415475 6 3169950 1.9236 0.1132
anova(lmBr, model134)
## Analysis of Variance Table
##
## Model 1: Respons ~ Bryder - 1
## Model 2: Respons ~ Blander * Bryder
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 33 8975758
## 2 27 7415475 6 1560283 0.9468 0.4787
We see that the Blander model has very low \(p\)-value, whereas the Bryder has a significantly higher \(p\)-value, such that we might toss the hypothesis that Blander alone is responsible for the attained strength of the tiles, but cannot do the same for Bryder.
anova(model134)
## Analysis of Variance Table
##
## Response: Respons
## Df Sum Sq Mean Sq F value Pr(>F)
## Blander 2 896450 448225 1.6320 0.21424
## Bryder 2 2506117 1253058 4.5624 0.01963 *
## Blander:Bryder 4 663833 165958 0.6043 0.66290
## Residuals 27 7415475 274647
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
INTERPRETATION OF THE ABOVE ANOVA, AND OF THE DIFFERENCE BETWEEN DOING ANOVAS WITH AND WITHOUT -1!!!\
We may start off by defining various starting parameters;
n <- 10^3 #outcomes
mu <- 4
delta <- 3
reprep <- 10^4 #number of repeated experiments
dat <- matrix(NA,n,reprep)
Then running the simulation of reprep=10^{4} datasets of n = 1000 \(U(\mu-\delta,\mu+\delta)\) datapoints.
for (i in 1:reprep) {
dat[,i] <- runif(n, min=mu-delta, max = mu+delta)
}
datMax <- apply(dat, 2, max) #apply the max function to all columns (2) of dat
datMin <- apply(dat, 2, min)
#Definitions in problem b)
muhat <- (datMin+datMax)/2
deltahat <- (datMax-datMin)/2
Note also that
mean(muhat)
## [1] 3.999994
mean(deltahat)
## [1] 2.99399
We may do histograms of the data
p_1 <- qplot(muhat, geom = "histogram")
p_2 <- qplot(deltahat, geom = "histogram")
grid.arrange(p_1,p_2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To verify c);
mucheck <- apply(dat, 2, median)
mean(mucheck)
## [1] 4.000596
qplot(mucheck)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Bemærk at Poisson fordelingen, ligesom data, har støtte på \(\mathbb{N}_0,\) således at idet datapunkter fremkommer som om de er af eksponentielt aftagende form, kunne der sagtens være tale om noget Poisson.
Under antagelse om at data er Poisson fordelt, kan det bemærkes at Poissonfordelingen er entydigt bestemt ved dens middelværdi. Vores bedste estimat for middelværdien i dataen, som er antaget Poisson fordelt vil dermed være gennemsnittet af data. Vi har \(150\) eksperimenter og får
library(pander)
## Warning: package 'pander' was built under R version 4.0.5
n <- 150
matrixx <- matrix(c(0:4,98,40,8,3,1), nrow=2, byrow=T)
matrixx[2,]
## [1] 98 40 8 3 1
data <- as.data.frame(matrixx)
rownames(data) <- c("Number of colonies per dish", "Number of Petri dishes")
data
## V1 V2 V3 V4 V5
## Number of colonies per dish 0 1 2 3 4
## Number of Petri dishes 98 40 8 3 1
pander(matrixx)
| 0 | 1 | 2 | 3 | 4 |
| 98 | 40 | 8 | 3 | 1 |
(meandata <- (0*98+1*40+2*8+3*3+4*1)/150)
## [1] 0.46
lambda <- meandata
Vi vil gerne have lov til at pakke vores poissonmodel ind i en multinomial familie, som vi så kan få lov til at bruge velkendte resultater på, hvorved vi kan analyserer om data kan siges at komme fra den spåede Poisson-model.
Bemærk at vi har fem kategorier af datapunkter; henholdsvis 0 kolonier per petriskål, 1 kolonier per petriskål,…, 4 kolonier per petri skål, således at vi ville kunne få en multinomial model med \(k=4.\)
Som nævnt på side 130 vil vi dog gerne have mindst fem datapunkter for hver kategori.
Vi ser dermed at vi skal reducere til \(k=2\) med de tre kategorier; 0 kolonier per petriskål, 1 kolonier per petriskål, $$2 kolonier per petriskål, således at vi får tabel;
matrixx_2 <- matrix(c(98,40,14), nrow=1, byrow=T)
colnames(matrixx_2) <- c("0","1",">=2")
rownames(matrixx_2) <- c("Number of Petri dishes")
pander(matrixx_2)
| 0 | 1 | >=2 | |
|---|---|---|---|
| Number of Petri dishes | 98 | 40 | 14 |
For \(k=2\) kan vi eksempelvis vælge at parametriserer modellen på basis af sandsynlighederne, og vi vælger at parametriserer på basis af \((\pi_0,\pi_1)\in\mathopen{}\left({0,1}\right)\mathclose{}^2,\) hvor \(\pi_0:=P(X=0)\) og \(\pi_1:=P(X=1).\)
Under Poisson modellen for \(X\) kan vi dermed erklære at \(\pi_0:=P(X=0)=\frac{\lambda^0}{0!}e^{-\lambda},\,\pi_1:=P(X=1)=\frac{\lambda^1}{1!}e^{-\lambda}.\)
We will be reading in data as
restaurant <- read_csv("restaurant.txt",
col_types = cols(D = col_factor(levels = c("m",
"ti", "o", "to", "f", "l", "s")),
B = col_factor(levels = c("br", "fr",
"mi")), T = col_factor(levels = c("fa",
"ta"))))
table(restaurant$B,restaurant$T, restaurant$D)
## , , = m
##
##
## fa ta
## br 0 0
## fr 4 4
## mi 0 0
##
## , , = ti
##
##
## fa ta
## br 0 0
## fr 4 4
## mi 0 0
##
## , , = o
##
##
## fa ta
## br 0 0
## fr 4 4
## mi 0 0
##
## , , = to
##
##
## fa ta
## br 0 0
## fr 4 4
## mi 0 0
##
## , , = f
##
##
## fa ta
## br 0 0
## fr 4 4
## mi 0 0
##
## , , = l
##
##
## fa ta
## br 4 4
## fr 0 0
## mi 4 4
##
## , , = s
##
##
## fa ta
## br 4 4
## fr 0 0
## mi 4 4
count(restaurant, B, T)
## # A tibble: 6 x 3
## B T n
## <fct> <fct> <int>
## 1 br fa 8
## 2 br ta 8
## 3 fr fa 20
## 4 fr ta 20
## 5 mi fa 8
## 6 mi ta 8
count(restaurant, B, D)
## # A tibble: 9 x 3
## B D n
## <fct> <fct> <int>
## 1 br l 8
## 2 br s 8
## 3 fr m 8
## 4 fr ti 8
## 5 fr o 8
## 6 fr to 8
## 7 fr f 8
## 8 mi l 8
## 9 mi s 8
count(restaurant, D, T,B)
## # A tibble: 18 x 4
## D T B n
## <fct> <fct> <fct> <int>
## 1 m fa fr 4
## 2 m ta fr 4
## 3 ti fa fr 4
## 4 ti ta fr 4
## 5 o fa fr 4
## 6 o ta fr 4
## 7 to fa fr 4
## 8 to ta fr 4
## 9 f fa fr 4
## 10 f ta fr 4
## 11 l fa br 4
## 12 l fa mi 4
## 13 l ta br 4
## 14 l ta mi 4
## 15 s fa br 4
## 16 s fa mi 4
## 17 s ta br 4
## 18 s ta mi 4
table((restaurant$B:restaurant$T), restaurant$D) #Table for (B x T) min D
##
## m ti o to f l s
## br:fa 0 0 0 0 0 4 4
## br:ta 0 0 0 0 0 4 4
## fr:fa 4 4 4 4 4 0 0
## fr:ta 4 4 4 4 4 0 0
## mi:fa 0 0 0 0 0 4 4
## mi:ta 0 0 0 0 0 4 4
table(interaction(restaurant$B, restaurant$T, restaurant$D))
##
## br.fa.m fr.fa.m mi.fa.m br.ta.m fr.ta.m mi.ta.m br.fa.ti fr.fa.ti
## 0 4 0 0 4 0 0 4
## mi.fa.ti br.ta.ti fr.ta.ti mi.ta.ti br.fa.o fr.fa.o mi.fa.o br.ta.o
## 0 0 4 0 0 4 0 0
## fr.ta.o mi.ta.o br.fa.to fr.fa.to mi.fa.to br.ta.to fr.ta.to mi.ta.to
## 4 0 0 4 0 0 4 0
## br.fa.f fr.fa.f mi.fa.f br.ta.f fr.ta.f mi.ta.f br.fa.l fr.fa.l
## 0 4 0 0 4 0 4 0
## mi.fa.l br.ta.l fr.ta.l mi.ta.l br.fa.s fr.fa.s mi.fa.s br.ta.s
## 4 4 0 4 4 0 4 4
## fr.ta.s mi.ta.s
## 0 4
tablesum <- function(x) {
newrow <- apply(x,2,sum) #columnsums
newx <- rbind(x,newrow)
newcol <- apply(newx,1,sum) #rowsums
newnewx <- cbind(newx,newcol)
colnames(newnewx)[ncol(newnewx)] <- c("RowSum")
rownames(newnewx)[nrow(newnewx)] <- c("ColSum")
newnewx
}
# balanceequation <- function(x,S=F) { #Fancy, can be used with tablesum function
# nuco <- ncol(x) - S
# nuro <- nrow(x) - S
# for (i in 1:nuro) {
# for (j in 1:nuco) {
# x[i,]
# }
# }
# }
SamhBalanceEquation <- function(x) {
#Cannot be used with tablesum function,
# !!! requires connected components (sammenhængende design). !!!
sx <- sum(x)
testy <- x
rs <- apply(x,1,sum)
cs <- apply(x,2,sum)
nuco <- ncol(x)
nuro <- nrow(x)
for (i in 1:nuro) {
for (j in 1:nuco) {
if (x[i,j]==rs[i]*cs[j]/sx) {
testy[i,j] <- TRUE
} else {
testy[i,j] <- FALSE
}
}
}
if (sum(testy) == nuco*nuro) {
TRUE
} else {
FALSE
}
}
attach(restaurant)
## The following object is masked _by_ .GlobalEnv:
##
## X
## The following object is masked from package:base:
##
## T
SamhBalanceEquation(table(restaurant$B, restaurant$T))
## [1] TRUE
SamhBalanceEquation(table(interaction(restaurant$B, restaurant$T), restaurant$D))
## [1] FALSE
#Expand to make it test balance equation for 'ikke sammenhængene' designs.
We may do the models
m0 <- lm(X~D+B*T, data=restaurant)
m1 <- lm(log(X)~D+B*T, data=restaurant)
StdresQQPlot(m0)
StdresQQPlot(m1)
Note that this
Look at page 67, note that we have a simple hypothesis of \(\theta=4,\) such that \(d\) (the dimension of the hypothesis) \(=0,\) and \(m\) (the dimension of the parameter space in the general model) \(=1\)
t <- c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0)
x <- c(0.88,0.86,0.62,0.79,0.66,0.54,0.88,0.87,0.73,0.79)
n <- length(x)
(y <- -t*log(x))
## [1] 0.01278334 0.03016458 0.14341074 0.09428893 0.20775772 0.36971168
## [7] 0.08948336 0.11140965 0.28323967 0.23572233
(S <- sum(y))
## [1] 1.577972
(ymean <- S/n)
## [1] 0.1577972
theta <- 4
LRfunc <- function(theta) {
2*n*(-log(theta)+theta*ymean-log(ymean)-1)
}
LR <- LRfunc(theta=4)
pchisq(LR, df = 1)#So p-value;
## [1] 0.8234918
1-pchisq(LR, df = 1)
## [1] 0.1765082
Så vi kan ikke forkaste hypotesen om at \(\theta=4.\)
Vi har i opgave 1.2 beregnet at \(\hat\theta = -\frac{n}{t_i\cdot\sum_{i=1}^{n}{\log{x_i}}},\) så for data, får vi en \(\hat\theta\) af størrelse:
(hattheta <- -n/(sum(t*log(x))))
## [1] 6.337248
Dermed, kan vi finde \(95\%\) fraktil i \(\chi^2_{1}\) fordelingen, og finde det passende konfidensinterval; se s. 98 BMS:
(g025 <- qgamma(0.025, n, 1))
## [1] 4.795389
(g975 <- qgamma(0.975, n, 1))
## [1] 17.0848
c(g025/S, g975/S)
## [1] 3.038957 10.827064
qchisq(0.95, 1)
## [1] 3.841459
#uniroot(LRfunc, c(mean(LRfunc(4))/10,mean(LRfunc(4)))) #Virker ikke.
seqT <- seq(0.001,15, by=0.01)
vals <- LRfunc(seqT)
plot(seqT,vals)
Vi bruger kommandoer;
theta <- 5
sim <- function(n) {
q <- c()
for (i in 1:5000) {
t <- (1:n)/n
x <- rbeta(n, shape1=5*t, shape2=1)
q[i] <- -n/(sum(t*log(x)))
}
q
}
dat10 <- sim(10)
dat25 <- sim(25)
dat250 <- sim(250)
(mdat10 <- mean(dat10))
## [1] 5.602083
(mdat25 <- mean(dat25))
## [1] 5.216197
(mdat250 <- mean(dat250))
## [1] 5.017503
(sddat10 <- sd(dat10))
## [1] 2.01727
(sddat25 <- sd(dat25))
## [1] 1.079147
(sddat250 <- sd(dat250))
## [1] 0.3210583
seqT <- seq(0.001,15, by=0.01)
nsim <- function(n) dnorm(seqT, theta, sd=sqrt(theta^2/n))
par(mfrow=c(1,3))
hist(dat10, prob=T); lines(x=seqT, y=nsim(10))
hist(dat25, prob=T); lines(x=seqT, y=nsim(25))
hist(dat250, prob=T); lines(x=seqT, y=nsim(250))
moral <- read_table2("moral.txt",
col_types = cols(koen = col_factor(levels = c("mand",
"kvinde")), rang = col_factor(levels = c("officer",
"menig"))))
Note that \(dim(V_G),\,G\in\mathbb{G_{+1}}\equiv\mathbb{G}\cup\mathopen{}\left\{{1}\right\}\mathclose{},\) will be…
sum((moral$score)^2)
## [1] 288373.4
dL1 <- 1
dLK <- 2
dLR <- 2
dLKxR <- 4
dLI <- 225
dV1 <- dL1
(dVR <- dLR-dV1)
## [1] 1
(dVK <- dLK-dV1)
## [1] 1
(dVKxR <- dLKxR - dVK - dVR - dV1)
## [1] 1
(dLKpR <- dVK+dVR+dV1)
## [1] 3
P1 <- 251710
PK <- 252453
PR <- 254250
PKxR <- 255038
PI <- 288373 #=||X||
Q1 <- P1
(QK <- PK-Q1)
## [1] 743
(QR <- PR-Q1)
## [1] 2540
(QKxR <- PKxR-QK-QR-Q1)
## [1] 45
(QI <- PI-QKxR-QK-QR-Q1)
## [1] 33335
(PKpR <- QK + QR + Q1)
## [1] 254993
(QI-PKxR)
## [1] -221703
(taeller <- (PKxR-PKpR)/(dLKxR-dLKpR))
## [1] 45
(naevner <- (PI-PKxR)/(dLI-dLKxR))
## [1] 150.8371
(Feren <- taeller/naevner)
## [1] 0.2983351
pf(Feren, df1 = dLKxR-dLKpR, df2 = dLI-dLKxR, lower.tail = F)
## [1] 0.585479
Alternatively:
m1 <- lm(score~koen+rang, data=moral)
anova(m1,m0)
## Warning in anova.lmlist(object, ...): models with response '"X"' removed because
## response differs from model 1
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## koen 1 743 743.09 4.942 0.02722 *
## rang 1 2539 2539.33 16.888 5.58e-05 ***
## Residuals 222 33381 150.36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table(moral$koen, moral$rang)
##
## officer menig
## mand 60 120
## kvinde 15 30
m0 <- lm(score~koen*rang, data=moral)
summary(m0)
##
## Call:
## lm(formula = score ~ koen * rang, data = moral)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.423 -8.050 -0.861 8.267 32.709
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.973 1.586 23.318 < 2e-16 ***
## koenkvinde 6.127 3.545 1.728 0.085342 .
## rangmenig -6.651 1.942 -3.425 0.000732 ***
## koenkvinde:rangmenig -2.376 4.342 -0.547 0.584791
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.28 on 221 degrees of freedom
## Multiple R-squared: 0.09076, Adjusted R-squared: 0.07842
## F-statistic: 7.353 on 3 and 221 DF, p-value: 0.0001016
We may draw out the estimates:
(cof <- m0$coefficients)
## (Intercept) koenkvinde rangmenig
## 36.972667 6.127333 -6.651250
## koenkvinde:rangmenig
## -2.376083
cof[2] <- cof[1]+cof[2]
cof[3] <- cof[1]+cof[3]
cof[4] <- cof[1]+cof[4]
names(cof) <- c("mand, officer", "kvinde, officer", "mand, menig", "kvinde, menig")
cof
## mand, officer kvinde, officer mand, menig kvinde, menig
## 36.97267 43.10000 30.32142 34.59658
We may also do a std. res + QQ-plot
StdresQQPlot(m0)
Looking at the QQplot, there seems to be no signifigant anomilies to speak of. In regards to homogeneity of the variance, this also seems rather plausable when looking at the Std. Res. plot. There might seem to be slightly less variation of the performance of female officers, than the other groupings, though it is not far off, and could also be attributed the low number of female officers counted: 15.
By 3.3, we get that we get a \(p\)-value for the \(F\)-test of the hypothesis of 0.585479, such that we may not throw the hypothesis of there being no interaction effects present out.
Note from the summary of the ‘ingen vekselvirkningsmodel’ m1 that based on the summary, \(\alpha=0.05\) confidence level \(p\)-values are calculated, for whether we might remove any
m1 <- lm(score~koen+rang, data=moral)
summary(m1)
##
## Call:
## lm(formula = score ~ koen + rang, data = moral)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.056 -7.849 -0.813 8.421 32.867
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.289 1.474 25.303 < 2e-16 ***
## koenkvinde 4.543 2.044 2.223 0.0272 *
## rangmenig -7.126 1.734 -4.109 5.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.26 on 222 degrees of freedom
## Multiple R-squared: 0.08953, Adjusted R-squared: 0.08133
## F-statistic: 10.91 on 2 and 222 DF, p-value: 3.01e-05
We see that testing away \(K\) can be done at a Pr(>|t|) = 0.0272205<0.05 level. Such that if we restricted to a \(\alpha=0.01\) level, we could have said that at such a level \(K\) would not be significant, but with a level \(\alpha=0.05,\) both koen and rang are significant.
Alternatively, we might look at
confint(m1, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 34.3851622 40.193793
## koenkvinde 0.5157167 8.570839
## rangmenig -10.5439656 -3.708968
of which we see that \(0\) is not included in any of the intervals, so both \(K\) and \(R\) are significant at a \(\alpha=0.05\)-level.
We are in a hypothesis of there being no interactional effects of koen and rang, we will thus have three parameters: intercept (with reference group mand,officer) and the two contrasts, one for koen and one for rang.
overv <- cbind(coef(m1),confint(m1, level=0.95))
colnames(overv)[1] <- "Estimate"
overv
## Estimate 2.5 % 97.5 %
## (Intercept) 37.289478 34.3851622 40.193793
## koenkvinde 4.543278 0.5157167 8.570839
## rangmenig -7.126467 -10.5439656 -3.708968
The model tells us that the estimate for the score for a male officer (reference population occupying (Intercept)) is 37.2894778. Switching from this reference population to a female officer would yield an estimated improvement of score of 4.5432778 totalling 41.8327556, while switching from the (Intercept) reference to a menig male, we would get an improvement (non-improvement) of a menig male over a male officer of 30.1630111. As a final point, one might note that the estimate for a menig female will then get an improvement of 37.2894778+4.5432778= -2.5831889 with respect to the reference group, (remember that we assume no interactional effects) thus totalling 34.7062889.
As in 3.6, we may conclude that there is an \(\alpha=0.05\) significant difference between the performance of males and females (with estimate for the difference being 4.5432778).
confint(m1, level=1-0.0272)
## 1.36 % 98.64 %
## (Intercept) 3.401283e+01 40.566123
## koenkvinde -6.110427e-04 9.087167
## rangmenig -1.098208e+01 -3.270849
confint(m1, level=1-0.02)
## 1 % 99 %
## (Intercept) 33.8361123 40.742843
## koenkvinde -0.2456785 9.332234
## rangmenig -11.1900308 -3.062903
confint(m1, level=1-0.0273)
## 1.36 % 98.64 %
## (Intercept) 34.014984266 40.563971
## koenkvinde 0.002372294 9.084183
## rangmenig -10.979552739 -3.273381